Skip to main content
Version Next

Delta Lake Configuration

The tdp-deltalake chart packages maintenance jobs for Delta Lake 4.0.0 tables, using Apache Spark 4.0.0.

What is Delta Lake and why does it need maintenance?

Delta Lake is an open table format that adds ACID transactions to object storage (S3/Ozone).

Instead of writing Parquet files directly, Delta Lake maintains a transaction log — a history of all operations (INSERT, UPDATE, DELETE, MERGE) performed on the table.

Over time, this model generates two types of accumulation that require periodic cleanup:

  1. Orphan files: Parquet files that exist in S3 but are no longer referenced by any active version of the table (result of aborted operations or old versions)
  2. Old versions: the transaction log retains history indefinitely; without cleanup, it grows without limit

The tdp-deltalake chart is not a continuous application — it creates Kubernetes CronJobs that run periodically (like a Linux cron, but managed by Kubernetes) to perform this maintenance automatically.

Learn more

See Delta Lake — Concepts for a complete overview of the format, ACID transactions, and use cases.

Overview

This chart provides CronJobs for scheduled maintenance of Delta Lake tables:

  • VACUUM — removes old files no longer referenced by the transaction log
  • OPTIMIZE — compacts small files into larger files
  • OPTIMIZE Z-ORDER — optimizes the data layout for queries with specific filters
  • DESCRIBE HISTORY — records the change history for auditing

The chart uses the upstream Spark chart as a dependency to provide the runtime used by the maintenance jobs.

Compatibility

ComponentVersion
Spark4.0.0
Delta Lake4.0.0
Scala2.13

Installation

Terminal input
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-deltalake \
-n <namespace> --create-namespace

Other OCI registries may exist in your organization; use the internal chart catalog or the instructions provided by Tecnisys.

Maintenance jobs

Configure the maintenance jobs under maintenance.jobs:

maintenance:
enabled: true
jobs:
vacuum:
enabled: true
schedule: "0 2 * * *" # Daily at 2am
retentionHours: 168 # 7 days

optimize:
enabled: true
schedule: "0 3 * * 0" # Weekly on Sundays at 3am

optimizeZOrder:
enabled: false
schedule: "0 1 * * 6" # Saturdays at 1am

VACUUM

Removes files that are no longer referenced by the Delta Log:

VACUUM delta.`s3a://warehouse/delta/table` RETAIN 168 HOURS;
Important

The default retention period is 7 days (168 hours). Do not reduce it below 7 days in production, as doing so may cause data loss for in-progress queries.

OPTIMIZE

Compacts small files into larger files to improve read performance:

OPTIMIZE delta.`s3a://warehouse/delta/table`;

OPTIMIZE with Z-ORDER

Optimizes the data layout for queries with specific filters:

OPTIMIZE delta.`s3a://warehouse/delta/table` ZORDER BY (column1, column2);

Spark configuration

Configure the Spark runtime used by the maintenance jobs:

maintenance:
spark:
image:
repository: "docker.io/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"
resources:
requests:
cpu: 1
memory: 2Gi
limits:
cpu: 2
memory: 4Gi

Enabling the Spark dependency

The Spark dependency is controlled via maintenance.spark.enabled:

maintenance:
spark:
enabled: true

S3/MinIO configuration

The CronJobs require a Secret named minio-credentials with the following keys:

  • access-key
  • secret-key

Create the Secret

Terminal input
kubectl -n <namespace> create secret generic minio-credentials \
--from-literal=access-key='<ACCESS_KEY>' \
--from-literal=secret-key='<SECRET_KEY>'

S3 endpoint configuration

maintenance:
spark:
config:
"spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
"spark.hadoop.fs.s3a.path.style.access": "true"

TDP integration

This chart is part of TDP and can be integrated with:

  • tdp-trino — SQL queries on Delta Lake tables
  • tdp-spark — Spark processing
  • tdp-airflow — pipeline orchestration

Main parameters

ParameterDescriptionDefault
maintenance.enabledEnable maintenance jobstrue
maintenance.spark.enabledEnable Spark dependencytrue
maintenance.spark.image.repositorySpark imagedocker.io/bitnamilegacy/spark
maintenance.spark.image.tagSpark image tag4.0.0-debian-12-r0
maintenance.jobs.vacuum.enabledEnable VACUUM jobtrue
maintenance.jobs.vacuum.scheduleVACUUM cron schedule0 2 * * *
maintenance.jobs.vacuum.retentionHoursVACUUM retention (hours)168
maintenance.jobs.optimize.enabledEnable OPTIMIZE jobtrue
maintenance.jobs.optimize.scheduleOPTIMIZE cron schedule0 3 * * 0
maintenance.jobs.optimizeZOrder.enabledEnable OPTIMIZE Z-ORDERfalse