ClickHouse

Columnar OLAP Database

ClickHouse is an open source, high-performance columnar database management system (DBMS) designed for online analytical processing (OLAP). Originally developed by Yandex for the Yandex.Metrica web analytics service, ClickHouse is capable of processing billions of rows and tens of gigabytes of data per second on a single server.

In TDP Kubernetes, ClickHouse is a component exclusive to the Kubernetes edition, offering a real-time data analytics solution complementary to Apache Superset and Trino.

Why ClickHouse?

ClickHouse excels in high-volume analytical scenarios for several reasons:

Extreme performance: analytical queries over billions of records in milliseconds
Columnar storage: efficient compression and optimized reads for queries that access few columns
Linear scalability: distribution of data and queries across multiple cluster nodes
SQL compatible: supports a rich SQL syntax, making it easy for analysts and data engineers to adopt
Real-time ingestion: capable of inserting millions of records per second
Smart compression: compression algorithms adapted to each column's data type

Columnar Storage vs. Row Storage

The main difference between ClickHouse and traditional databases (such as PostgreSQL or MySQL) lies in the storage model:

Characteristic	Row Storage	Columnar Storage
Organization	Data stored row by row	Data stored column by column
Reads	Reads all columns of a row	Reads only the required columns
Compression	Moderate (heterogeneous data)	High (homogeneous data per column)
Best for	OLTP (transactions)	OLAP (analytics)
Example	PostgreSQL, MySQL	ClickHouse, Apache Druid

In typical analytical scenarios, where queries access few columns from tables with millions or billions of rows, columnar storage is significantly more efficient.

ClickHouse Architecture

MergeTree Engine

The MergeTree is ClickHouse's primary storage engine. It organizes data into parts that are periodically merged in the background, optimizing compression and read performance.

MergeTree variants include:

ReplacingMergeTree: removes duplicate rows based on a sorting key
SummingMergeTree: automatically aggregates numeric values during merges
AggregatingMergeTree: maintains pre-computed aggregation states
CollapsingMergeTree: supports logical deletion of records through collapsing

Replication and Sharding

ClickHouse supports native replication and sharding for high availability and scalability:

Replication: uses ZooKeeper (or ClickHouse Keeper) to coordinate replicas, ensuring eventual data consistency
Sharding: distributes data across multiple nodes using configurable distribution keys

In TDP Kubernetes, ClickHouse is deployed with replication enabled by default, ensuring high availability of analytical data.

Use Cases in TDP

Log and Metrics Analysis

ClickHouse is ideal for storing and querying large volumes of logs and operational metrics, offering:

Real-time event ingestion
Configurable retention via TTL (Time-To-Live)
Ad-hoc queries with millisecond latency

Real-Time Dashboards

Integrated with Apache Superset, ClickHouse provides the data layer for interactive dashboards with real-time updates, especially useful for:

Data pipeline monitoring
Real-time business KPIs
Time series analysis

Data Warehousing

ClickHouse can be used as a data warehouse for complex analytical queries, complementing Trino in scenarios where response latency is critical.

SQL in ClickHouse

ClickHouse supports a rich and familiar SQL syntax, including:

-- Criar tabela com engine MergeTree
CREATE TABLE eventos (
    data DateTime,
    usuario_id UInt64,
    evento String,
    valor Float64
) ENGINE = MergeTree()
ORDER BY (data, usuario_id);

-- Inserir dados
INSERT INTO eventos VALUES
    ('2024-01-01 10:00:00', 1001, 'login', 0),
    ('2024-01-01 10:05:00', 1001, 'compra', 150.00);

-- Consulta analítica com agregação
SELECT
    toDate(data) AS dia,
    count() AS total_eventos,
    countIf(evento = 'compra') AS total_compras,
    sum(valor) AS valor_total
FROM eventos
WHERE data >= '2024-01-01'
GROUP BY dia
ORDER BY dia;

Specialized Functions

ClickHouse offers an extensive library of optimized functions:

Date/time functions: toStartOfHour(), toMonday(), dateDiff()
Aggregation functions: quantile(), uniq(), groupArray()
Array functions: arrayJoin(), arrayMap(), arrayFilter()
Approximate functions: uniqHLL12(), quantileTDigest() for fast estimations

Integration with the TDP Ecosystem

In TDP Kubernetes, ClickHouse integrates with the following components:

Component	Integration
Apache Superset	Data source for dashboards and visualizations
CloudBeaver	Administration and SQL queries via web interface
Apache NiFi	Data ingestion from various sources into ClickHouse
Trino	Federated queries involving data in ClickHouse
Apache Airflow	Orchestration of load and transformation pipelines

Best Practices

Choose the sorting key carefully: MergeTree's ORDER BY determines how data is physically organized; choose columns that will be frequently filtered
Use appropriate data types: prefer UInt32 over Int64 when possible; use LowCardinality(String) for columns with few distinct values
Configure TTL for retention: use TTL data + INTERVAL 90 DAY to automatically remove old data
Avoid SELECT *: query only the necessary columns to take advantage of columnar storage
Use materialized views: to pre-compute frequent aggregations and speed up dashboards

ClickHouse Project Details

ClickHouse was developed predominantly in C++, leveraging the language's native performance for low-level operations and vectorized processing of columnar data.

ClickHouse Resources

Source(s):

ClickHouse

Columnar OLAP Database

Why ClickHouse?​

Columnar Storage vs. Row Storage​

ClickHouse Architecture​

MergeTree Engine​

Replication and Sharding​

Use Cases in TDP​

Log and Metrics Analysis​

Real-Time Dashboards​

Data Warehousing​

SQL in ClickHouse​

Specialized Functions​

Integration with the TDP Ecosystem​

Best Practices​

ClickHouse Project Details​

ClickHouse Resources​