Delta Lake Configuration
The tdp-deltalake chart packages maintenance jobs for Delta Lake 4.0.0 tables, using Apache Spark 4.0.0.
What is Delta Lake and why does it need maintenance?
Delta Lake is an open table format that adds ACID transactions to object storage (S3/Ozone).
Instead of writing Parquet files directly, Delta Lake maintains a transaction log — a history of all operations (INSERT, UPDATE, DELETE, MERGE) performed on the table.
Over time, this model generates two types of accumulation that require periodic cleanup:
- Orphan files: Parquet files that exist in S3 but are no longer referenced by any active version of the table (result of aborted operations or old versions)
- Old versions: the transaction log retains history indefinitely; without cleanup, it grows without limit
The tdp-deltalake chart is not a continuous application — it creates Kubernetes CronJobs that run periodically (like a Linux cron, but managed by Kubernetes) to perform this maintenance automatically.
See Delta Lake — Concepts for a complete overview of the format, ACID transactions, and use cases.
Overview
This chart provides CronJobs for scheduled maintenance of Delta Lake tables:
- VACUUM — removes old files no longer referenced by the transaction log
- OPTIMIZE — compacts small files into larger files
- OPTIMIZE Z-ORDER — optimizes the data layout for queries with specific filters
- DESCRIBE HISTORY — records the change history for auditing
The chart uses the upstream Spark chart as a dependency to provide the runtime used by the maintenance jobs.
Compatibility
| Component | Version |
|---|---|
| Spark | 4.0.0 |
| Delta Lake | 4.0.0 |
| Scala | 2.13 |
Installation
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-deltalake \
-n <namespace> --create-namespace
Other OCI registries may exist in your organization; use the internal chart catalog or the instructions provided by Tecnisys.
Maintenance jobs
Configure the maintenance jobs under maintenance.jobs:
maintenance:
enabled: true
jobs:
vacuum:
enabled: true
schedule: "0 2 * * *" # Daily at 2am
retentionHours: 168 # 7 days
optimize:
enabled: true
schedule: "0 3 * * 0" # Weekly on Sundays at 3am
optimizeZOrder:
enabled: false
schedule: "0 1 * * 6" # Saturdays at 1am
VACUUM
Removes files that are no longer referenced by the Delta Log:
VACUUM delta.`s3a://warehouse/delta/table` RETAIN 168 HOURS;
The default retention period is 7 days (168 hours). Do not reduce it below 7 days in production, as doing so may cause data loss for in-progress queries.
OPTIMIZE
Compacts small files into larger files to improve read performance:
OPTIMIZE delta.`s3a://warehouse/delta/table`;
OPTIMIZE with Z-ORDER
Optimizes the data layout for queries with specific filters:
OPTIMIZE delta.`s3a://warehouse/delta/table` ZORDER BY (column1, column2);
Spark configuration
Configure the Spark runtime used by the maintenance jobs:
maintenance:
spark:
image:
repository: "docker.io/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"
resources:
requests:
cpu: 1
memory: 2Gi
limits:
cpu: 2
memory: 4Gi
Enabling the Spark dependency
The Spark dependency is controlled via maintenance.spark.enabled:
maintenance:
spark:
enabled: true
S3/MinIO configuration
The CronJobs require a Secret named minio-credentials with the following keys:
access-keysecret-key
Create the Secret
kubectl -n <namespace> create secret generic minio-credentials \
--from-literal=access-key='<ACCESS_KEY>' \
--from-literal=secret-key='<SECRET_KEY>'
S3 endpoint configuration
maintenance:
spark:
config:
"spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
"spark.hadoop.fs.s3a.path.style.access": "true"
TDP integration
This chart is part of TDP and can be integrated with:
- tdp-trino — SQL queries on Delta Lake tables
- tdp-spark — Spark processing
- tdp-airflow — pipeline orchestration
Main parameters
| Parameter | Description | Default |
|---|---|---|
maintenance.enabled | Enable maintenance jobs | true |
maintenance.spark.enabled | Enable Spark dependency | true |
maintenance.spark.image.repository | Spark image | docker.io/bitnamilegacy/spark |
maintenance.spark.image.tag | Spark image tag | 4.0.0-debian-12-r0 |
maintenance.jobs.vacuum.enabled | Enable VACUUM job | true |
maintenance.jobs.vacuum.schedule | VACUUM cron schedule | 0 2 * * * |
maintenance.jobs.vacuum.retentionHours | VACUUM retention (hours) | 168 |
maintenance.jobs.optimize.enabled | Enable OPTIMIZE job | true |
maintenance.jobs.optimize.schedule | OPTIMIZE cron schedule | 0 3 * * 0 |
maintenance.jobs.optimizeZOrder.enabled | Enable OPTIMIZE Z-ORDER | false |