Skip to main content
THIS COMPONENT IS AVAILABLE STARTING FROM VERSION 4.1.0 OF PGSYS.

Vector Databases

Specialized system for storing, managing, and querying high-dimensionality data (vectors/embeddings). Unlike traditional databases that search for exact matches, vector databases find data by similarity, being fundamental for AI applications such as semantic search, recommendation systems, and LLMs.

Fundamental Concepts

Vector: List of numbers representing coordinates in multidimensional space.

  • Example: [12, 13, 19, 8, 9]
  • Just as (x, y) defines a 2D point, vectors define points in spaces of hundreds or thousands of dimensions

Embedding: Numerical vector representation of data (text, image, audio) generated by Machine Learning models.

  • Semantically similar data generates embeddings close in vector space
  • Example: "weather forecast" and "will it rain?" have numerically close embeddings

How It Works

The similarity search process involves two stages:

1. Indexing

  • Raw data (text, image, product) is processed by ML model
  • Model generates numerical vector (embedding) capturing data characteristics
  • Vector is stored with reference to original data
  • Semantically similar vectors are organized close in space

2. Search

  • Query is converted to numerical vector (embedding) using the same ML model
  • Optimized algorithms calculate distances between vectors
  • System returns original data from closest vectors
Figure 1: Vector Database operation flow: Indexing (1 and 2) and Query (3 and 4).
Figure 1: Vector Database operation flow: Indexing (1 and 2) and Query (3 and 4).

Data vectorization (embedding generation) is done once, during indexing, and vectors are stored in the vector database.

During query, the search text is vectorized and the vector database calculates similarity between the search vector and stored vectors.

Use Cases

ApplicationDescriptionExamples
Semantic SearchFinds contextually relevant information regardless of exact keywordsImage search, article recommendation, document search
Recommendation SystemsSuggests similar items based on characteristics and preferencesProducts, music, movies, content
ML/Deep LearningLong-term memory for models - stores embeddings for fast queryFeature cache, transfer learning
LLMs and Generative AIProvides additional context via RAG (Retrieval-Augmented Generation)ChatGPT with external knowledge, specialized assistants

Advantages

Querying ML models directly for each search is expensive, slow, and not scalable. Vector databases solve this:

AdvantageBenefit
EfficiencyData processed once during indexing; queries in milliseconds
Cost-EffectivenessEliminates continuous reprocessing by AI model
ScalabilitySupports billions of vectors without performance loss
Persistent MemoryStores knowledge accessible quickly and contextually

pgvector

Open-source extension that adds vector operations and similarity search support to PostgreSQL, allowing direct storage, indexing, and querying of vector data in the database.

Main features:

  • Efficient storage of dense vectors
  • Fast similarity search with multiple distance operators
  • Native integration with PostgreSQL query planner
  • Support for specialized indexing (IVFFlat and HNSW)

vector Data Type

pgvector introduces the vector(n) data type, where n represents the number of dimensions.

Characteristics:

  • Dimensionality: Through parameter n you can specify vector dimensionality (e.g., vector(1536) for OpenAI embedding models). When omitted, the column accepts vectors of any dimension.
  • Storage: Compact binary format in PostgreSQL data pages
  • Validation: If n is specified, ensures dimensional consistency across all entries
Usage Example
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536) -- OpenAI text-embedding-ada-002
);

INSERT INTO documents (content, embedding)
VALUES ('Example text', '[0.1, 0.2, ..., 0.9]');

Operators for Distance Calculation Between Vectors

Similarity search calculates distance between vectors considering three main aspects:

  • Magnitude (size): Measure of vector length. Higher numerical value means greater magnitude. In 2D coordinates, for example, a vector with greater magnitude will be "longer" than one with lesser magnitude.

  • Direction (orientation): Direction the vector points. In 2D coordinates, a vector with horizontal (or vertical) direction points right (or down).

  • Alignment (angle between vectors): Angle between two vectors. Smaller angle means more aligned and greater similarity. 0° angle indicates perfectly aligned vectors (same direction), while 180° indicates opposite vectors.

In Figure 2, we see an example of 2D vectors considering the three aspects above.

Figure 2: Representation of vector similarity, considering Direction, Magnitude, and Alignment.
Figure 2: Representation of vector similarity, considering Direction, Magnitude, and Alignment.

pgvector provides three main operators for distance calculation between vectors:

OperatorNameDescriptionRecommended Use
<->Euclidean Distance (L2)Straight-line distance between vectorsImages, geographic coordinates
<=>Cosine DistanceAngle between vectors (direction)Text embeddings (most common)
<#>Inner ProductProjection of one vector onto anotherRecommendation systems

Let's see a practical example:

Similarity Search
-- Create table with 2D vectors
CREATE TABLE items (
id SERIAL PRIMARY KEY,
nome TEXT,
categoria TEXT,
embedding VECTOR(2)
);

-- Insert sample data
INSERT INTO items (nome, categoria, embedding) VALUES
('Man', 'Human Being', '[2, 8]'),
('Child', 'Human Being', '[3, 7]'),
('Dog', 'Domestic Animal', '[6, 7]'),
('Cat', 'Domestic Animal', '[7, 7.5]');

-- Search for the 3 items closest to 'Man' vector ([2, 8]) using Euclidean Distance
SELECT nome, categoria, embedding <-> '[2, 8]' AS distancia
FROM items
ORDER BY distancia
LIMIT 3;

Result:

nomecategoriadistancia
ManHuman Being0
ChildHuman Being1.4142
DogDomestic Animal4.1231

To avoid sequential searches in tables with millions of vectors, pgvector offers ANN (Approximate Nearest Neighbor) indexes that significantly speed up similarity searches.

IVFFlat (Inverted File with Flat Compression)

Clustering-based index that partitions vector space using k-means. During query, searches only in clusters closest to input vector.

Characteristics:

  • Requires training phase during creation, i.e., prior vector insertion
  • Approximate search with precision/speed trade-off
  • Lower memory usage and fast construction
IVFFlat Example
-- Creation (lists ≈ sqrt(total_rows))
CREATE INDEX items_embedding_idx ON items
USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

-- Search configuration
SET ivfflat.probes = 10; -- Number of clusters to check

-- Query
SELECT id, content FROM items
ORDER BY embedding <-> '[1, 2, 4]' LIMIT 5;

Parameters:

ParameterDescriptionRecommendation
listsNumber of clusterssqrt(rows) for large datasets
probesClusters checked in search10 (adjust for precision/speed)

HNSW (Hierarchical Navigable Small World)

Hierarchical graph index that navigates multiple layers to find close neighbors. Offers higher precision than IVFFlat.

Characteristics:

  • No training phase
  • High precision (recall > 95%)
  • Higher memory usage and slower construction
HNSW Example
-- Creation
CREATE INDEX items_embedding_hnsw_idx ON items
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);

-- Search configuration
SET hnsw.ef_search = 40; -- Dynamic list during query

-- Query
SELECT id, content FROM items
ORDER BY embedding <-> '[1, 2, 4]' LIMIT 5;

Parameters:

ParameterDescriptionDefault ValueRecommendation
mConnections per layer1616-32
ef_constructionConstruction quality6464-200
ef_searchSearch precision4040-400

Index Comparison

CriterionIVFFlatHNSW
PrecisionModerate (80-95%)High (>95%)
SpeedFastFast and consistent
MemoryModerateHigh
ConstructionFastSlow
Recommended useLarge datasets, prototypingProduction, high precision

Source: https://github.com/pgvector/pgvector

pgembed

While extensions like pgvector prepare PostgreSQL to store and query vector data, the process of generating those vectors — vectorization (embedding) — still needs to be executed separately.

Normally, this process requires an application or pipeline to communicate with an external Machine Learning model, such as those offered by OpenAI, or locally with models available in Ollama. This may require broader technical knowledge from the data team or involvement of various professionals with varied expertise.

In many cases, projects focused on validating the Minimum Viable Product (MVP) choose to develop functions in PostgreSQL itself that consume vectorization services and facilitate this activity.

However, it's common for best practices, such as using batch data to reduce network traffic, memory consumption management, security configurations, and resilience mechanisms, to be neglected in this approach.

Aiming to provide a set of functions for data vectorization in PostgreSQL, focusing on performance, security, and resilience, Tecnisys developed the pgembed extension.

Vectorization Functions

After installing and creating the pgembed extension, various functions for data vectorization will be available in the pgembed schema of the database, organized by embedding API provider and input data type:

ProviderSingle Text LineRow SetTable
Ollama (local)embed_ollamaembed_batch_ollamaembed_table_ollama
OpenAIembed_openaiembed_batch_openaiembed_table_openai
Custom APIsembed_customembed_batch_customembed_table_custom

Function Types:

  • Single Text Line (embed_*): Generates embedding for a single text line. For example:
    SELECT pgembed.embed_openai('Hello, world!', 'text-embedding-3-small');
  • Row Set (embed_batch_*): Processes multiple text lines and returns a list of vectors. For example:
    SELECT * FROM pgembed.embed_batch_openai(ARRAY['Hello, world!', 'Hello, PostgreSQL!'], 'text-embedding-3-small');
  • Table (embed_table_*): Reads text column from a table and automatically updates corresponding embeddings column, eliminating need for manual update operations. For example:
    SELECT * FROM pgembed.embed_table_openai(
    'public', -- schema
    'tb_documents', -- table name
    'content', -- content column
    'embedding', -- embedding column
    FALSE, -- regenerate: only update NULL embeddings (default: FALSE)
    1000, -- batch_size
    'text-embedding-3-small' -- model
    );

Advanced Parameters

Besides mandatory parameters, pgembed extension functions have several extremely useful advanced parameters, such as:

  • Request timeout (timeout, default 60s)
  • Whether SSL certificate should be verified (verify_ssl, default TRUE)
  • Whether text should be truncated if it exceeds context size (truncate, default TRUE, for Ollama models)
  • JSON for advanced options for each model (options, for Ollama models)
  • Number of dimensions of generated embeddings (dimensions, for OpenAI models)
  • Encoding format of returned embeddings (encode_format, for OpenAI models)

Security and Resilience

Regarding security and resilience, the pgembed extension provides parameters that can be configured at user, session, and database level, such as:

  • pgembed.openai_api_key or pgembed.custom_api_key to define API key used in service request
  • pgembed.url_allowlist to define list of authorized URLs. By default: localhost, 127.0.0.1, 0.0.0.0 (any port), *.openai.com and api.openai.com
  • pgembed.max_retries, pgembed.initial_backoff_ms, pgembed.max_backoff_ms for retry control
  • pgembed.circuit_breaker_threshold and pgembed.circuit_breaker_reset_timeout_s to prevent cascading failures from temporarily blocking requests to unavailable services

Source: https://github.com/tecnisys/pgembed