Skip to main content

ClickHouse

Columnar OLAP Database

ClickHouse icon
This component is available only for TDP Kubernetes from v3.0.0.

ClickHouse is an open source, high-performance columnar database management system (DBMS) designed for online analytical processing (OLAP). Originally developed by Yandex for the Yandex.Metrica web analytics service, ClickHouse is capable of processing billions of rows and tens of gigabytes of data per second on a single server.

In TDP Kubernetes, ClickHouse is a component exclusive to the Kubernetes edition, offering a real-time data analytics solution complementary to Apache Superset and Trino.

Why ClickHouse?

ClickHouse excels in high-volume analytical scenarios for several reasons:

  • Extreme performance: analytical queries over billions of records in milliseconds
  • Columnar storage: efficient compression and optimized reads for queries that access few columns
  • Linear scalability: distribution of data and queries across multiple cluster nodes
  • SQL compatible: supports a rich SQL syntax, making it easy for analysts and data engineers to adopt
  • Real-time ingestion: capable of inserting millions of records per second
  • Smart compression: compression algorithms adapted to each column's data type

Columnar Storage vs. Row Storage

The main difference between ClickHouse and traditional databases (such as PostgreSQL or MySQL) lies in the storage model:

CharacteristicRow StorageColumnar Storage
OrganizationData stored row by rowData stored column by column
ReadsReads all columns of a rowReads only the required columns
CompressionModerate (heterogeneous data)High (homogeneous data per column)
Best forOLTP (transactions)OLAP (analytics)
ExamplePostgreSQL, MySQLClickHouse, Apache Druid

In typical analytical scenarios, where queries access few columns from tables with millions or billions of rows, columnar storage is significantly more efficient.

ClickHouse Architecture

MergeTree Engine

The MergeTree is ClickHouse's primary storage engine. It organizes data into parts that are periodically merged in the background, optimizing compression and read performance.

MergeTree variants include:

  • ReplacingMergeTree: removes duplicate rows based on a sorting key
  • SummingMergeTree: automatically aggregates numeric values during merges
  • AggregatingMergeTree: maintains pre-computed aggregation states
  • CollapsingMergeTree: supports logical deletion of records through collapsing

Replication and Sharding

ClickHouse supports native replication and sharding for high availability and scalability:

  • Replication: uses ZooKeeper (or ClickHouse Keeper) to coordinate replicas, ensuring eventual data consistency
  • Sharding: distributes data across multiple nodes using configurable distribution keys

In TDP Kubernetes, ClickHouse is deployed with replication enabled by default, ensuring high availability of analytical data.

Use Cases in TDP

Log and Metrics Analysis

ClickHouse is ideal for storing and querying large volumes of logs and operational metrics, offering:

  • Real-time event ingestion
  • Configurable retention via TTL (Time-To-Live)
  • Ad-hoc queries with millisecond latency

Real-Time Dashboards

Integrated with Apache Superset, ClickHouse provides the data layer for interactive dashboards with real-time updates, especially useful for:

  • Data pipeline monitoring
  • Real-time business KPIs
  • Time series analysis

Data Warehousing

ClickHouse can be used as a data warehouse for complex analytical queries, complementing Trino in scenarios where response latency is critical.

SQL in ClickHouse

ClickHouse supports a rich and familiar SQL syntax, including:

-- Criar tabela com engine MergeTree
CREATE TABLE eventos (
data DateTime,
usuario_id UInt64,
evento String,
valor Float64
) ENGINE = MergeTree()
ORDER BY (data, usuario_id);

-- Inserir dados
INSERT INTO eventos VALUES
('2024-01-01 10:00:00', 1001, 'login', 0),
('2024-01-01 10:05:00', 1001, 'compra', 150.00);

-- Consulta analítica com agregação
SELECT
toDate(data) AS dia,
count() AS total_eventos,
countIf(evento = 'compra') AS total_compras,
sum(valor) AS valor_total
FROM eventos
WHERE data >= '2024-01-01'
GROUP BY dia
ORDER BY dia;

Specialized Functions

ClickHouse offers an extensive library of optimized functions:

  • Date/time functions: toStartOfHour(), toMonday(), dateDiff()
  • Aggregation functions: quantile(), uniq(), groupArray()
  • Array functions: arrayJoin(), arrayMap(), arrayFilter()
  • Approximate functions: uniqHLL12(), quantileTDigest() for fast estimations

Integration with the TDP Ecosystem

In TDP Kubernetes, ClickHouse integrates with the following components:

ComponentIntegration
Apache SupersetData source for dashboards and visualizations
CloudBeaverAdministration and SQL queries via web interface
Apache NiFiData ingestion from various sources into ClickHouse
TrinoFederated queries involving data in ClickHouse
Apache AirflowOrchestration of load and transformation pipelines

Best Practices

  • Choose the sorting key carefully: MergeTree's ORDER BY determines how data is physically organized; choose columns that will be frequently filtered
  • Use appropriate data types: prefer UInt32 over Int64 when possible; use LowCardinality(String) for columns with few distinct values
  • Configure TTL for retention: use TTL data + INTERVAL 90 DAY to automatically remove old data
  • Avoid SELECT *: query only the necessary columns to take advantage of columnar storage
  • Use materialized views: to pre-compute frequent aggregations and speed up dashboards

ClickHouse Project Details

ClickHouse was developed predominantly in C++, leveraging the language's native performance for low-level operations and vectorized processing of columnar data.

ClickHouse Resources

Source(s):