Version Next

Delta Lake Configuration

The tdp-deltalake chart packages maintenance jobs for Delta Lake 4.0.0 tables, using Apache Spark 4.0.0.

What is Delta Lake and why does it need maintenance?

Delta Lake is an open table format that adds ACID transactions to object storage (S3/Ozone).

Instead of writing Parquet files directly, Delta Lake maintains a transaction log — a history of all operations (INSERT, UPDATE, DELETE, MERGE) performed on the table.

Over time, this model generates two types of accumulation that require periodic cleanup:

Orphan files: Parquet files that exist in S3 but are no longer referenced by any active version of the table (result of aborted operations or old versions)
Old versions: the transaction log retains history indefinitely; without cleanup, it grows without limit

The tdp-deltalake chart is not a continuous application — it creates Kubernetes CronJobs that run periodically (like a Linux cron, but managed by Kubernetes) to perform this maintenance automatically.

Learn more

See Delta Lake — Concepts for a complete overview of the format, ACID transactions, and use cases.

Overview

This chart provides CronJobs for scheduled maintenance of Delta Lake tables:

VACUUM — removes old files no longer referenced by the transaction log
OPTIMIZE — compacts small files into larger files
OPTIMIZE Z-ORDER — optimizes the data layout for queries with specific filters
DESCRIBE HISTORY — records the change history for auditing

The chart uses the upstream Spark chart as a dependency to provide the runtime used by the maintenance jobs.

Compatibility

Component	Version
Spark	4.0.0
Delta Lake	4.0.0
Scala	2.13

Installation

Terminal input
helm upgrade --install <release> \
  oci://registry.tecnisys.com.br/tdp/charts/tdp-deltalake \
  -n <namespace> --create-namespace

Other OCI registries may exist in your organization; use the internal chart catalog or the instructions provided by Tecnisys.

Maintenance jobs

Configure the maintenance jobs under maintenance.jobs:

maintenance:
  enabled: true
  jobs:
    vacuum:
      enabled: true
      schedule: "0 2 * * *"    # Daily at 2am
      retentionHours: 168      # 7 days

    optimize:
      enabled: true
      schedule: "0 3 * * 0"    # Weekly on Sundays at 3am

    optimizeZOrder:
      enabled: false
      schedule: "0 1 * * 6"    # Saturdays at 1am

VACUUM

Removes files that are no longer referenced by the Delta Log:

VACUUM delta.`s3a://warehouse/delta/table` RETAIN 168 HOURS;

Important

The default retention period is 7 days (168 hours). Do not reduce it below 7 days in production, as doing so may cause data loss for in-progress queries.

OPTIMIZE

Compacts small files into larger files to improve read performance:

OPTIMIZE delta.`s3a://warehouse/delta/table`;

OPTIMIZE with Z-ORDER

Optimizes the data layout for queries with specific filters:

OPTIMIZE delta.`s3a://warehouse/delta/table` ZORDER BY (column1, column2);

Spark configuration

Configure the Spark runtime used by the maintenance jobs:

maintenance:
  spark:
    image:
      repository: "docker.io/bitnamilegacy/spark"
      tag: "4.0.0-debian-12-r0"
    resources:
      requests:
        cpu: 1
        memory: 2Gi
      limits:
        cpu: 2
        memory: 4Gi

Enabling the Spark dependency

The Spark dependency is controlled via maintenance.spark.enabled:

maintenance:
  spark:
    enabled: true

S3/MinIO configuration

The CronJobs require a Secret named minio-credentials with the following keys:

access-key
secret-key

Create the Secret

Terminal input
kubectl -n <namespace> create secret generic minio-credentials \
  --from-literal=access-key='<ACCESS_KEY>' \
  --from-literal=secret-key='<SECRET_KEY>'

S3 endpoint configuration

maintenance:
  spark:
    config:
      "spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
      "spark.hadoop.fs.s3a.path.style.access": "true"

TDP integration

This chart is part of TDP and can be integrated with:

tdp-trino — SQL queries on Delta Lake tables
tdp-spark — Spark processing
tdp-airflow — pipeline orchestration

Main parameters

Parameter	Description	Default
`maintenance.enabled`	Enable maintenance jobs	`true`
`maintenance.spark.enabled`	Enable Spark dependency	`true`
`maintenance.spark.image.repository`	Spark image	`docker.io/bitnamilegacy/spark`
`maintenance.spark.image.tag`	Spark image tag	`4.0.0-debian-12-r0`
`maintenance.jobs.vacuum.enabled`	Enable VACUUM job	`true`
`maintenance.jobs.vacuum.schedule`	VACUUM cron schedule	`0 2 * * *`
`maintenance.jobs.vacuum.retentionHours`	VACUUM retention (hours)	`168`
`maintenance.jobs.optimize.enabled`	Enable OPTIMIZE job	`true`
`maintenance.jobs.optimize.schedule`	OPTIMIZE cron schedule	`0 3 * * 0`
`maintenance.jobs.optimizeZOrder.enabled`	Enable OPTIMIZE Z-ORDER	`false`

What is Delta Lake and why does it need maintenance?​

Overview​

Compatibility​

Installation​

Maintenance jobs​

VACUUM​

OPTIMIZE​

OPTIMIZE with Z-ORDER​

Spark configuration​

Enabling the Spark dependency​

S3/MinIO configuration​

Create the Secret​

S3 endpoint configuration​

TDP integration​

Main parameters​