Iceberg Configuration
The tdp-iceberg chart packages maintenance jobs for Apache Iceberg 1.10.0 tables, using Apache Spark as the runtime.
This page focuses on component configuration: what the jobs do, how to enable them, and how the chart organizes the Spark runtime. Details for S3/MinIO, Hive Metastore, and other integrations are in Integrations — Iceberg.
What is Apache Iceberg and why does it need maintenance?
Apache Iceberg is an open table format for large datasets, designed to overcome limitations of Hive.
Like Delta Lake, Iceberg maintains snapshots: immutable versions of the table for each write operation. With continuous use, this tends to accumulate:
- old snapshots;
- orphan files;
- small files that hurt read performance.
The tdp-iceberg chart creates Kubernetes CronJobs to run these maintenance routines on a schedule.
See Apache Iceberg — Concepts for a complete overview of the format, snapshots, and use cases.
Overview
This chart provides three types of maintenance jobs:
- Expire snapshots: removes old snapshots based on a retention policy;
- Remove orphan files: removes files that are no longer referenced by the metadata;
- Rewrite data files: rewrites data files for compaction and optimization.
Compatibility
| Component | Version |
|---|---|
| Spark | 4.0.0 |
| Iceberg (Spark runtime) | 1.10.0 |
| Scala | 2.13 |
Installation
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-iceberg \
-n <namespace> --create-namespace
Functional prerequisites
Before enabling the jobs, the environment needs:
- access to an S3/MinIO endpoint;
- access to Hive Metastore, if the Iceberg catalog uses
type: hive; - Spark configuration consistent with the catalog and storage.
These points are detailed in Integrations — Iceberg.
Maintenance jobs
Jobs are configured under maintenance.jobs.*.
The examples below show the expected command format. Adjust catalog, endpoint, warehouse, table, and retention windows for your environment.
Expire Snapshots
Removes old snapshots to free storage and metadata space:
maintenance:
jobs:
expireSnapshots:
enabled: true
schedule: "0 2 * * *"
retentionDays: 7
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.expire_snapshots(older_than => TIMESTAMP '$(date -d '7 days ago' '+%Y-%m-%d %H:%M:%S')');"
Remove Orphan Files
Removes orphan files that are no longer referenced by valid snapshots:
maintenance:
jobs:
removeOrphanFiles:
enabled: true
schedule: "0 3 * * 0"
olderThanDays: 3
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.remove_orphan_files(older_than => TIMESTAMP '$(date -d '3 days ago' '+%Y-%m-%d %H:%M:%S')');"
Rewrite Data Files
Rewrites and compacts data files to improve read performance. Disabled by default because it is more resource-intensive:
maintenance:
jobs:
rewriteDataFiles:
enabled: false
schedule: "0 1 * * 6"
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.rewrite_data_files(table => 'iceberg.default.<table-name>');"
Job parameters
| Parameter | Description |
|---|---|
enabled | Enable or disable the CronJob |
schedule | Cron expression |
retentionDays | Retention days for expireSnapshots |
olderThanDays | Minimum age for removeOrphanFiles |
command | Shell script executed by the container |
Spark configuration
This chart uses two distinct image contexts for Spark. It is important not to confuse them:
| Key | What it controls | Format |
|---|---|---|
spark.image.* | Master/worker Pods of the Spark subchart | separate registry + repository |
maintenance.spark.image.* | Maintenance CronJob containers | repository with full URL |
Spark subchart (master/worker)
spark:
image:
registry: registry.tecnisys.com.br
repository: community/images/bitnamilegacy/spark
tag: 4.0.0-debian-12-r0
pullPolicy: IfNotPresent
Maintenance container image
maintenance:
spark:
enabled: true
image:
repository: "registry.tecnisys.com.br/community/images/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"
pullPolicy: IfNotPresent
These two settings are independent. Changing spark.image.* does not change the image used by the maintenance CronJobs.
Integrations
For S3/MinIO, Hive Metastore, catalog configuration, and use from Spark, Airflow, or Trino, see Integrations — Iceberg.
Main parameters
| Parameter | Description | Default |
|---|---|---|
maintenance.spark.enabled | Enable Spark dependency | true |
maintenance.spark.image.repository | CronJob image | registry.tecnisys.com.br/community/images/bitnamilegacy/spark |
maintenance.spark.image.tag | Image tag | 4.0.0-debian-12-r0 |
maintenance.jobs.expireSnapshots.enabled | Enable expire snapshots | true |
maintenance.jobs.expireSnapshots.retentionDays | Retention days | 7 |
maintenance.jobs.removeOrphanFiles.enabled | Enable orphan removal | true |
maintenance.jobs.removeOrphanFiles.olderThanDays | Minimum age of orphans | 3 |
maintenance.jobs.rewriteDataFiles.enabled | Enable data rewrite | false |
Uninstallation
helm uninstall <release> -n <namespace>