Skip to main content
Version 3.0.0

Iceberg Configuration

The tdp-iceberg chart packages maintenance jobs for Apache Iceberg 1.10.0 tables, using Apache Spark as the runtime.

This page focuses on component configuration: what the jobs do, how to enable them, and how the chart organizes the Spark runtime. Details for S3/MinIO, Hive Metastore, and other integrations are in Integrations — Iceberg.

What is Apache Iceberg and why does it need maintenance?

Apache Iceberg is an open table format for large datasets, designed to overcome limitations of Hive.

Like Delta Lake, Iceberg maintains snapshots: immutable versions of the table for each write operation. With continuous use, this tends to accumulate:

  • old snapshots;
  • orphan files;
  • small files that hurt read performance.

The tdp-iceberg chart creates Kubernetes CronJobs to run these maintenance routines on a schedule.

Learn more

See Apache Iceberg — Concepts for a complete overview of the format, snapshots, and use cases.

Overview

This chart provides three types of maintenance jobs:

  • Expire snapshots: removes old snapshots based on a retention policy;
  • Remove orphan files: removes files that are no longer referenced by the metadata;
  • Rewrite data files: rewrites data files for compaction and optimization.

Compatibility

ComponentVersion
Spark4.0.0
Iceberg (Spark runtime)1.10.0
Scala2.13

Installation

Terminal input
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-iceberg \
-n <namespace> --create-namespace

Functional prerequisites

Before enabling the jobs, the environment needs:

  • access to an S3/MinIO endpoint;
  • access to Hive Metastore, if the Iceberg catalog uses type: hive;
  • Spark configuration consistent with the catalog and storage.

These points are detailed in Integrations — Iceberg.

Maintenance jobs

Jobs are configured under maintenance.jobs.*.

note

The examples below show the expected command format. Adjust catalog, endpoint, warehouse, table, and retention windows for your environment.

Expire Snapshots

Removes old snapshots to free storage and metadata space:

maintenance:
jobs:
expireSnapshots:
enabled: true
schedule: "0 2 * * *"
retentionDays: 7
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.expire_snapshots(older_than => TIMESTAMP '$(date -d '7 days ago' '+%Y-%m-%d %H:%M:%S')');"

Remove Orphan Files

Removes orphan files that are no longer referenced by valid snapshots:

maintenance:
jobs:
removeOrphanFiles:
enabled: true
schedule: "0 3 * * 0"
olderThanDays: 3
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.remove_orphan_files(older_than => TIMESTAMP '$(date -d '3 days ago' '+%Y-%m-%d %H:%M:%S')');"

Rewrite Data Files

Rewrites and compacts data files to improve read performance. Disabled by default because it is more resource-intensive:

maintenance:
jobs:
rewriteDataFiles:
enabled: false
schedule: "0 1 * * 6"
command: |
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg.type=hive \
--conf spark.sql.catalog.iceberg.uri=thrift://metastore.hive-metastore.svc.cluster.local:9083 \
--conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
-e "CALL iceberg.system.rewrite_data_files(table => 'iceberg.default.<table-name>');"

Job parameters

ParameterDescription
enabledEnable or disable the CronJob
scheduleCron expression
retentionDaysRetention days for expireSnapshots
olderThanDaysMinimum age for removeOrphanFiles
commandShell script executed by the container

Spark configuration

This chart uses two distinct image contexts for Spark. It is important not to confuse them:

KeyWhat it controlsFormat
spark.image.*Master/worker Pods of the Spark subchartseparate registry + repository
maintenance.spark.image.*Maintenance CronJob containersrepository with full URL

Spark subchart (master/worker)

spark:
image:
registry: registry.tecnisys.com.br
repository: community/images/bitnamilegacy/spark
tag: 4.0.0-debian-12-r0
pullPolicy: IfNotPresent

Maintenance container image

maintenance:
spark:
enabled: true
image:
repository: "registry.tecnisys.com.br/community/images/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"
pullPolicy: IfNotPresent
note

These two settings are independent. Changing spark.image.* does not change the image used by the maintenance CronJobs.

Integrations

For S3/MinIO, Hive Metastore, catalog configuration, and use from Spark, Airflow, or Trino, see Integrations — Iceberg.

Main parameters

ParameterDescriptionDefault
maintenance.spark.enabledEnable Spark dependencytrue
maintenance.spark.image.repositoryCronJob imageregistry.tecnisys.com.br/community/images/bitnamilegacy/spark
maintenance.spark.image.tagImage tag4.0.0-debian-12-r0
maintenance.jobs.expireSnapshots.enabledEnable expire snapshotstrue
maintenance.jobs.expireSnapshots.retentionDaysRetention days7
maintenance.jobs.removeOrphanFiles.enabledEnable orphan removaltrue
maintenance.jobs.removeOrphanFiles.olderThanDaysMinimum age of orphans3
maintenance.jobs.rewriteDataFiles.enabledEnable data rewritefalse

Uninstallation

Terminal input
helm uninstall <release> -n <namespace>