THIS COMPONENT IS AVAILABLE STARTING FROM VERSION 4.1.0 OF PGSYS.

Vector Databases

Specialized system for storing, managing, and querying high-dimensionality data (vectors/embeddings). Unlike traditional databases that search for exact matches, vector databases find data by similarity, being fundamental for AI applications such as semantic search, recommendation systems, and LLMs.

Fundamental Concepts

Vector: List of numbers representing coordinates in multidimensional space.

Example: [12, 13, 19, 8, 9]
Just as (x, y) defines a 2D point, vectors define points in spaces of hundreds or thousands of dimensions

Embedding: Numerical vector representation of data (text, image, audio) generated by Machine Learning models.

Semantically similar data generates embeddings close in vector space
Example: "weather forecast" and "will it rain?" have numerically close embeddings

How It Works

The similarity search process involves two stages:

1. Indexing

Raw data (text, image, product) is processed by ML model
Model generates numerical vector (embedding) capturing data characteristics
Vector is stored with reference to original data
Semantically similar vectors are organized close in space

2. Search

Query is converted to numerical vector (embedding) using the same ML model
Optimized algorithms calculate distances between vectors
System returns original data from closest vectors

Figure 1: Vector Database operation flow: Indexing (1 and 2) and Query (3 and 4).

Data vectorization (embedding generation) is done once, during indexing, and vectors are stored in the vector database.

During query, the search text is vectorized and the vector database calculates similarity between the search vector and stored vectors.

Use Cases

Application	Description	Examples
Semantic Search	Finds contextually relevant information regardless of exact keywords	Image search, article recommendation, document search
Recommendation Systems	Suggests similar items based on characteristics and preferences	Products, music, movies, content
ML/Deep Learning	Long-term memory for models - stores embeddings for fast query	Feature cache, transfer learning
LLMs and Generative AI	Provides additional context via RAG (Retrieval-Augmented Generation)	ChatGPT with external knowledge, specialized assistants

Advantages

Querying ML models directly for each search is expensive, slow, and not scalable. Vector databases solve this:

Advantage	Benefit
Efficiency	Data processed once during indexing; queries in milliseconds
Cost-Effectiveness	Eliminates continuous reprocessing by AI model
Scalability	Supports billions of vectors without performance loss
Persistent Memory	Stores knowledge accessible quickly and contextually

pgvector

Open-source extension that adds vector operations and similarity search support to PostgreSQL, allowing direct storage, indexing, and querying of vector data in the database.

Main features:

Efficient storage of dense vectors
Fast similarity search with multiple distance operators
Native integration with PostgreSQL query planner
Support for specialized indexing (IVFFlat and HNSW)

vector Data Type

pgvector introduces the vector(n) data type, where n represents the number of dimensions.

Characteristics:

Dimensionality: Through parameter n you can specify vector dimensionality (e.g., vector(1536) for OpenAI embedding models). When omitted, the column accepts vectors of any dimension.
Storage: Compact binary format in PostgreSQL data pages
Validation: If n is specified, ensures dimensional consistency across all entries

Usage Example
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(1536)  -- OpenAI text-embedding-ada-002
);

INSERT INTO documents (content, embedding)
VALUES ('Example text', '[0.1, 0.2, ..., 0.9]');

Operators for Distance Calculation Between Vectors

Similarity search calculates distance between vectors considering three main aspects:

Magnitude (size): Measure of vector length. Higher numerical value means greater magnitude. In 2D coordinates, for example, a vector with greater magnitude will be "longer" than one with lesser magnitude.
Direction (orientation): Direction the vector points. In 2D coordinates, a vector with horizontal (or vertical) direction points right (or down).
Alignment (angle between vectors): Angle between two vectors. Smaller angle means more aligned and greater similarity. 0° angle indicates perfectly aligned vectors (same direction), while 180° indicates opposite vectors.

In Figure 2, we see an example of 2D vectors considering the three aspects above.

Figure 2: Representation of vector similarity, considering Direction, Magnitude, and Alignment.

pgvector provides three main operators for distance calculation between vectors:

Operator	Name	Description	Recommended Use
`<->`	Euclidean Distance (L2)	Straight-line distance between vectors	Images, geographic coordinates
`<=>`	Cosine Distance	Angle between vectors (direction)	Text embeddings (most common)
`<#>`	Inner Product	Projection of one vector onto another	Recommendation systems

Let's see a practical example:

Similarity Search
-- Create table with 2D vectors
CREATE TABLE items (
    id SERIAL PRIMARY KEY,
    nome TEXT,
    categoria TEXT,
    embedding VECTOR(2)
);

-- Insert sample data
INSERT INTO items (nome, categoria, embedding) VALUES
('Man', 'Human Being', '[2, 8]'),
('Child', 'Human Being', '[3, 7]'),
('Dog', 'Domestic Animal', '[6, 7]'),
('Cat', 'Domestic Animal', '[7, 7.5]');

-- Search for the 3 items closest to 'Man' vector ([2, 8]) using Euclidean Distance
SELECT nome, categoria, embedding <-> '[2, 8]' AS distancia
FROM items
ORDER BY distancia
LIMIT 3;

Result:

nome	categoria	distancia
Man	Human Being	0
Child	Human Being	1.4142
Dog	Domestic Animal	4.1231

Indexes for Vector Search

To avoid sequential searches in tables with millions of vectors, pgvector offers ANN (Approximate Nearest Neighbor) indexes that significantly speed up similarity searches.

IVFFlat (Inverted File with Flat Compression)

Clustering-based index that partitions vector space using k-means. During query, searches only in clusters closest to input vector.

Characteristics:

Requires training phase during creation, i.e., prior vector insertion
Approximate search with precision/speed trade-off
Lower memory usage and fast construction

IVFFlat Example
-- Creation (lists ≈ sqrt(total_rows))
CREATE INDEX items_embedding_idx ON items
USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

-- Search configuration
SET ivfflat.probes = 10;  -- Number of clusters to check

-- Query
SELECT id, content FROM items
ORDER BY embedding <-> '[1, 2, 4]' LIMIT 5;

Parameters:

Parameter	Description	Recommendation
`lists`	Number of clusters	`sqrt(rows)` for large datasets
`probes`	Clusters checked in search	10 (adjust for precision/speed)

HNSW (Hierarchical Navigable Small World)

Hierarchical graph index that navigates multiple layers to find close neighbors. Offers higher precision than IVFFlat.

Characteristics:

No training phase
High precision (recall > 95%)
Higher memory usage and slower construction

HNSW Example
-- Creation
CREATE INDEX items_embedding_hnsw_idx ON items
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);

-- Search configuration
SET hnsw.ef_search = 40;  -- Dynamic list during query

-- Query
SELECT id, content FROM items
ORDER BY embedding <-> '[1, 2, 4]' LIMIT 5;

Parameters:

Parameter	Description	Default Value	Recommendation
`m`	Connections per layer	16	16-32
`ef_construction`	Construction quality	64	64-200
`ef_search`	Search precision	40	40-400

Index Comparison

Criterion	IVFFlat	HNSW
Precision	Moderate (80-95%)	High (>95%)
Speed	Fast	Fast and consistent
Memory	Moderate	High
Construction	Fast	Slow
Recommended use	Large datasets, prototyping	Production, high precision

Source: https://github.com/pgvector/pgvector

pgembed

While extensions like pgvector prepare PostgreSQL to store and query vector data, the process of generating those vectors — vectorization (embedding) — still needs to be executed separately.

Normally, this process requires an application or pipeline to communicate with an external Machine Learning model, such as those offered by OpenAI, or locally with models available in Ollama. This may require broader technical knowledge from the data team or involvement of various professionals with varied expertise.

In many cases, projects focused on validating the Minimum Viable Product (MVP) choose to develop functions in PostgreSQL itself that consume vectorization services and facilitate this activity.

However, it's common for best practices, such as using batch data to reduce network traffic, memory consumption management, security configurations, and resilience mechanisms, to be neglected in this approach.

Aiming to provide a set of functions for data vectorization in PostgreSQL, focusing on performance, security, and resilience, Tecnisys developed the pgembed extension.

Vectorization Functions

After installing and creating the pgembed extension, various functions for data vectorization will be available in the pgembed schema of the database, organized by embedding API provider and input data type:

Provider	Single Text Line	Row Set	Table
Ollama (local)	`embed_ollama`	`embed_batch_ollama`	`embed_table_ollama`
OpenAI	`embed_openai`	`embed_batch_openai`	`embed_table_openai`
Custom APIs	`embed_custom`	`embed_batch_custom`	`embed_table_custom`

Function Types:

Single Text Line (embed_*): Generates embedding for a single text line. For example:
```
SELECT pgembed.embed_openai('Hello, world!', 'text-embedding-3-small');
```

Row Set (embed_batch_*): Processes multiple text lines and returns a list of vectors. For example:

SELECT * FROM pgembed.embed_batch_openai(ARRAY['Hello, world!', 'Hello, PostgreSQL!'], 'text-embedding-3-small');

Table (embed_table_*): Reads text column from a table and automatically updates corresponding embeddings column, eliminating need for manual update operations. For example:

SELECT * FROM pgembed.embed_table_openai(
  'public',                      -- schema
  'tb_documents',                -- table name
  'content',                     -- content column
  'embedding',                   -- embedding column
  FALSE,                         -- regenerate: only update NULL embeddings (default: FALSE)
  1000,                          -- batch_size
  'text-embedding-3-small'       -- model
);

Advanced Parameters

Besides mandatory parameters, pgembed extension functions have several extremely useful advanced parameters, such as:

Request timeout (timeout, default 60s)
Whether SSL certificate should be verified (verify_ssl, default TRUE)
Whether text should be truncated if it exceeds context size (truncate, default TRUE, for Ollama models)
JSON for advanced options for each model (options, for Ollama models)
Number of dimensions of generated embeddings (dimensions, for OpenAI models)
Encoding format of returned embeddings (encode_format, for OpenAI models)

Security and Resilience

Regarding security and resilience, the pgembed extension provides parameters that can be configured at user, session, and database level, such as:

pgembed.openai_api_key or pgembed.custom_api_key to define API key used in service request
pgembed.url_allowlist to define list of authorized URLs. By default: localhost, 127.0.0.1, 0.0.0.0 (any port), *.openai.com and api.openai.com
pgembed.max_retries, pgembed.initial_backoff_ms, pgembed.max_backoff_ms for retry control
pgembed.circuit_breaker_threshold and pgembed.circuit_breaker_reset_timeout_s to prevent cascading failures from temporarily blocking requests to unavailable services

Source: https://github.com/tecnisys/pgembed

Fundamental Concepts​

How It Works​

Use Cases​

Advantages​

pgvector​

vector Data Type​

Operators for Distance Calculation Between Vectors​

Indexes for Vector Search​

IVFFlat (Inverted File with Flat Compression)​

HNSW (Hierarchical Navigable Small World)​

Index Comparison​

pgembed​

Vectorization Functions​

Advanced Parameters​

Security and Resilience​

Fundamental Concepts

How It Works

Use Cases

Advantages

pgvector

vector Data Type

Operators for Distance Calculation Between Vectors

Indexes for Vector Search

IVFFlat (Inverted File with Flat Compression)

HNSW (Hierarchical Navigable Small World)

Index Comparison

pgembed

Vectorization Functions

Advanced Parameters

Security and Resilience