Iceberg Configuration
The tdp-iceberg chart packages maintenance jobs for Apache Iceberg 1.10.0 tables, using Apache Spark as the runtime.
What is Apache Iceberg and why does it need maintenance?
Apache Iceberg is an open table format for large datasets, designed to overcome the limitations of Hive.
Like Delta Lake, Iceberg maintains snapshots — immutable versions of the table for each write operation.
This enables consistent reads, time travel, and ACID operations.
With continuous use, Iceberg tables accumulate:
- Old snapshots: each write creates a new snapshot; old ones need to be expired to free metadata storage space
- Orphan files: files that were created but never referenced by a committed snapshot
- Small files: frequent write operations generate many small files that degrade read performance
The tdp-iceberg chart creates Kubernetes CronJobs that run these maintenance operations on a schedule, using Spark to process the tables in S3-compatible storage.
See Apache Iceberg — Concepts for a complete overview of the format, snapshots, and use cases.
Overview
This chart provides CronJobs for scheduled maintenance of Iceberg tables:
- Expire snapshots — removes old snapshots based on a retention policy
- Remove orphan files — removes orphan files not tracked by the metadata
- Rewrite data files — rewrites data files for compaction and optimization
The chart uses the upstream Spark chart as a dependency to provide the runtime.
Compatibility
| Component | Version |
|---|---|
| Spark | 4.0.0 |
| Iceberg (Spark runtime) | 1.10.0 |
| Scala | 2.13 |
Installation
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-iceberg \
-n <namespace> --create-namespace
Prerequisites
S3/MinIO credentials Secret
The maintenance CronJobs require a Secret named minio-credentials with the following keys:
access-keysecret-key
kubectl -n <namespace> create secret generic minio-credentials \
--from-literal=access-key='<ACCESS_KEY>' \
--from-literal=secret-key='<SECRET_KEY>'
Hive Metastore
By default, the Iceberg catalog is configured to use a Hive Metastore:
maintenance:
spark:
config:
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"
The typical example value is thrift://metastore.hive-metastore.svc.cluster.local:9083; adjust the host and namespace to your Hive Metastore.
Maintenance jobs
Jobs are configured under maintenance.jobs.*:
Expire Snapshots
Removes old snapshots to free storage space:
maintenance:
jobs:
expireSnapshots:
enabled: true
schedule: "0 2 * * *" # Daily at 2 AM
command: |
# Snapshot expiration script
spark-sql --conf ... -e "CALL iceberg.system.expire_snapshots(...)"
Remove Orphan Files
Removes orphan files that are no longer tracked by Iceberg metadata:
maintenance:
jobs:
removeOrphanFiles:
enabled: true
schedule: "0 3 * * 0" # Weekly on Sundays at 3 AM
command: |
# Orphan file removal script
spark-sql --conf ... -e "CALL iceberg.system.remove_orphan_files(...)"
Rewrite Data Files
Rewrites data files for compaction and performance optimization:
maintenance:
jobs:
rewriteDataFiles:
enabled: false
schedule: "0 4 * * 6" # Saturdays at 4 AM
command: |
# Data file rewrite script
spark-sql --conf ... -e "CALL iceberg.system.rewrite_data_files(...)"
Each job supports the following parameters:
| Parameter | Description |
|---|---|
enabled | Enable or disable the CronJob |
schedule | Cron expression for scheduling |
command | Shell script executed by the container (run verbatim) |
Spark configuration
Spark dependency (subchart)
The upstream Spark chart is configured under the top-level spark.* keys:
spark.image.*controls the Spark image used by master/worker Podsspark.commonLabelsandspark.master.podLabels/spark.worker.podLabelsapply labels to resources
Maintenance container image
The CronJobs use maintenance.spark.image.* for the container image:
maintenance:
spark:
enabled: true
image:
repository: "docker.io/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"
Spark configuration
maintenance:
spark:
config:
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"
"spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
"spark.hadoop.fs.s3a.path.style.access": "true"
Main parameters
| Parameter | Description | Default |
|---|---|---|
maintenance.spark.enabled | Enable Spark dependency | true |
maintenance.spark.image.repository | Spark image | docker.io/bitnamilegacy/spark |
maintenance.spark.image.tag | Spark image tag | 4.0.0-debian-12-r0 |
maintenance.jobs.expireSnapshots.enabled | Enable expire snapshots | true |
maintenance.jobs.expireSnapshots.schedule | Expire snapshots cron | 0 2 * * * |
maintenance.jobs.removeOrphanFiles.enabled | Enable orphan file removal | true |
maintenance.jobs.removeOrphanFiles.schedule | Orphan file removal cron | 0 3 * * 0 |
maintenance.jobs.rewriteDataFiles.enabled | Enable data file rewrite | false |
maintenance.jobs.rewriteDataFiles.schedule | Data file rewrite cron | 0 4 * * 6 |
Uninstallation
helm uninstall <release> -n <namespace>