ClickHouse
Columnar OLAP Database
ClickHouse is an open source, high-performance columnar database management system (DBMS) designed for online analytical processing (OLAP). Originally developed by Yandex for the Yandex.Metrica web analytics service, ClickHouse is capable of processing billions of rows and tens of gigabytes of data per second on a single server.
In TDP Kubernetes, ClickHouse is a component exclusive to the Kubernetes edition, offering a real-time data analytics solution complementary to Apache Superset and Trino.
Why ClickHouse?
ClickHouse excels in high-volume analytical scenarios for several reasons:
- Extreme performance: analytical queries over billions of records in milliseconds
- Columnar storage: efficient compression and optimized reads for queries that access few columns
- Linear scalability: distribution of data and queries across multiple cluster nodes
- SQL compatible: supports a rich SQL syntax, making it easy for analysts and data engineers to adopt
- Real-time ingestion: capable of inserting millions of records per second
- Smart compression: compression algorithms adapted to each column's data type
Columnar Storage vs. Row Storage
The main difference between ClickHouse and traditional databases (such as PostgreSQL or MySQL) lies in the storage model:
| Characteristic | Row Storage | Columnar Storage |
|---|---|---|
| Organization | Data stored row by row | Data stored column by column |
| Reads | Reads all columns of a row | Reads only the required columns |
| Compression | Moderate (heterogeneous data) | High (homogeneous data per column) |
| Best for | OLTP (transactions) | OLAP (analytics) |
| Example | PostgreSQL, MySQL | ClickHouse, Apache Druid |
In typical analytical scenarios, where queries access few columns from tables with millions or billions of rows, columnar storage is significantly more efficient.
ClickHouse Architecture
MergeTree Engine
The MergeTree is ClickHouse's primary storage engine. It organizes data into parts that are periodically merged in the background, optimizing compression and read performance.
MergeTree variants include:
- ReplacingMergeTree: removes duplicate rows based on a sorting key
- SummingMergeTree: automatically aggregates numeric values during merges
- AggregatingMergeTree: maintains pre-computed aggregation states
- CollapsingMergeTree: supports logical deletion of records through collapsing
Replication and Sharding
ClickHouse supports native replication and sharding for high availability and scalability:
- Replication: uses ZooKeeper (or ClickHouse Keeper) to coordinate replicas, ensuring eventual data consistency
- Sharding: distributes data across multiple nodes using configurable distribution keys
In TDP Kubernetes, ClickHouse is deployed with replication enabled by default, ensuring high availability of analytical data.
Use Cases in TDP
Log and Metrics Analysis
ClickHouse is ideal for storing and querying large volumes of logs and operational metrics, offering:
- Real-time event ingestion
- Configurable retention via TTL (Time-To-Live)
- Ad-hoc queries with millisecond latency
Real-Time Dashboards
Integrated with Apache Superset, ClickHouse provides the data layer for interactive dashboards with real-time updates, especially useful for:
- Data pipeline monitoring
- Real-time business KPIs
- Time series analysis
Data Warehousing
ClickHouse can be used as a data warehouse for complex analytical queries, complementing Trino in scenarios where response latency is critical.
SQL in ClickHouse
ClickHouse supports a rich and familiar SQL syntax, including:
-- Criar tabela com engine MergeTree
CREATE TABLE eventos (
data DateTime,
usuario_id UInt64,
evento String,
valor Float64
) ENGINE = MergeTree()
ORDER BY (data, usuario_id);
-- Inserir dados
INSERT INTO eventos VALUES
('2024-01-01 10:00:00', 1001, 'login', 0),
('2024-01-01 10:05:00', 1001, 'compra', 150.00);
-- Consulta analítica com agregação
SELECT
toDate(data) AS dia,
count() AS total_eventos,
countIf(evento = 'compra') AS total_compras,
sum(valor) AS valor_total
FROM eventos
WHERE data >= '2024-01-01'
GROUP BY dia
ORDER BY dia;
Specialized Functions
ClickHouse offers an extensive library of optimized functions:
- Date/time functions:
toStartOfHour(),toMonday(),dateDiff() - Aggregation functions:
quantile(),uniq(),groupArray() - Array functions:
arrayJoin(),arrayMap(),arrayFilter() - Approximate functions:
uniqHLL12(),quantileTDigest()for fast estimations
Integration with the TDP Ecosystem
In TDP Kubernetes, ClickHouse integrates with the following components:
| Component | Integration |
|---|---|
| Apache Superset | Data source for dashboards and visualizations |
| CloudBeaver | Administration and SQL queries via web interface |
| Apache NiFi | Data ingestion from various sources into ClickHouse |
| Trino | Federated queries involving data in ClickHouse |
| Apache Airflow | Orchestration of load and transformation pipelines |
Best Practices
- Choose the sorting key carefully: MergeTree's
ORDER BYdetermines how data is physically organized; choose columns that will be frequently filtered - Use appropriate data types: prefer
UInt32overInt64when possible; useLowCardinality(String)for columns with few distinct values - Configure TTL for retention: use
TTL data + INTERVAL 90 DAYto automatically remove old data - Avoid
SELECT *: query only the necessary columns to take advantage of columnar storage - Use materialized views: to pre-compute frequent aggregations and speed up dashboards
ClickHouse Project Details
ClickHouse was developed predominantly in C++, leveraging the language's native performance for low-level operations and vectorized processing of columnar data.
ClickHouse Resources
Source(s):