Version Next

Iceberg Configuration

The tdp-iceberg chart packages maintenance jobs for Apache Iceberg 1.10.0 tables, using Apache Spark as the runtime.

What is Apache Iceberg and why does it need maintenance?

Apache Iceberg is an open table format for large datasets, designed to overcome the limitations of Hive.

Like Delta Lake, Iceberg maintains snapshots — immutable versions of the table for each write operation.

This enables consistent reads, time travel, and ACID operations.

With continuous use, Iceberg tables accumulate:

Old snapshots: each write creates a new snapshot; old ones need to be expired to free metadata storage space
Orphan files: files that were created but never referenced by a committed snapshot
Small files: frequent write operations generate many small files that degrade read performance

The tdp-iceberg chart creates Kubernetes CronJobs that run these maintenance operations on a schedule, using Spark to process the tables in S3-compatible storage.

Learn more

See Apache Iceberg — Concepts for a complete overview of the format, snapshots, and use cases.

Overview

This chart provides CronJobs for scheduled maintenance of Iceberg tables:

Expire snapshots — removes old snapshots based on a retention policy
Remove orphan files — removes orphan files not tracked by the metadata
Rewrite data files — rewrites data files for compaction and optimization

The chart uses the upstream Spark chart as a dependency to provide the runtime.

Compatibility

Component	Version
Spark	4.0.0
Iceberg (Spark runtime)	1.10.0
Scala	2.13

Installation

Terminal input
helm upgrade --install <release> \
  oci://registry.tecnisys.com.br/tdp/charts/tdp-iceberg \
  -n <namespace> --create-namespace

Prerequisites

S3/MinIO credentials Secret

The maintenance CronJobs require a Secret named minio-credentials with the following keys:

access-key
secret-key

Terminal input
kubectl -n <namespace> create secret generic minio-credentials \
  --from-literal=access-key='<ACCESS_KEY>' \
  --from-literal=secret-key='<SECRET_KEY>'

Hive Metastore

By default, the Iceberg catalog is configured to use a Hive Metastore:

maintenance:
  spark:
    config:
      "spark.sql.catalog.iceberg.type": "hive"
      "spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"

Tip

The typical example value is thrift://metastore.hive-metastore.svc.cluster.local:9083; adjust the host and namespace to your Hive Metastore.

Maintenance jobs

Jobs are configured under maintenance.jobs.*:

Expire Snapshots

Removes old snapshots to free storage space:

maintenance:
  jobs:
    expireSnapshots:
      enabled: true
      schedule: "0 2 * * *"    # Daily at 2 AM
      command: |
        # Snapshot expiration script
        spark-sql --conf ... -e "CALL iceberg.system.expire_snapshots(...)"

Remove Orphan Files

Removes orphan files that are no longer tracked by Iceberg metadata:

maintenance:
  jobs:
    removeOrphanFiles:
      enabled: true
      schedule: "0 3 * * 0"    # Weekly on Sundays at 3 AM
      command: |
        # Orphan file removal script
        spark-sql --conf ... -e "CALL iceberg.system.remove_orphan_files(...)"

Rewrite Data Files

Rewrites data files for compaction and performance optimization:

maintenance:
  jobs:
    rewriteDataFiles:
      enabled: false
      schedule: "0 4 * * 6"    # Saturdays at 4 AM
      command: |
        # Data file rewrite script
        spark-sql --conf ... -e "CALL iceberg.system.rewrite_data_files(...)"

Each job supports the following parameters:

Parameter	Description
`enabled`	Enable or disable the CronJob
`schedule`	Cron expression for scheduling
`command`	Shell script executed by the container (run verbatim)

Spark configuration

Spark dependency (subchart)

The upstream Spark chart is configured under the top-level spark.* keys:

spark.image.* controls the Spark image used by master/worker Pods
spark.commonLabels and spark.master.podLabels / spark.worker.podLabels apply labels to resources

Maintenance container image

The CronJobs use maintenance.spark.image.* for the container image:

maintenance:
  spark:
    enabled: true
    image:
      repository: "docker.io/bitnamilegacy/spark"
      tag: "4.0.0-debian-12-r0"

Spark configuration

maintenance:
  spark:
    config:
      "spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
      "spark.sql.catalog.iceberg.type": "hive"
      "spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"
      "spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
      "spark.hadoop.fs.s3a.path.style.access": "true"

Main parameters

Parameter	Description	Default
`maintenance.spark.enabled`	Enable Spark dependency	`true`
`maintenance.spark.image.repository`	Spark image	`docker.io/bitnamilegacy/spark`
`maintenance.spark.image.tag`	Spark image tag	`4.0.0-debian-12-r0`
`maintenance.jobs.expireSnapshots.enabled`	Enable expire snapshots	`true`
`maintenance.jobs.expireSnapshots.schedule`	Expire snapshots cron	`0 2 * * *`
`maintenance.jobs.removeOrphanFiles.enabled`	Enable orphan file removal	`true`
`maintenance.jobs.removeOrphanFiles.schedule`	Orphan file removal cron	`0 3 * * 0`
`maintenance.jobs.rewriteDataFiles.enabled`	Enable data file rewrite	`false`
`maintenance.jobs.rewriteDataFiles.schedule`	Data file rewrite cron	`0 4 * * 6`

Uninstallation

Terminal input

helm uninstall <release> -n <namespace>

What is Apache Iceberg and why does it need maintenance?​

Overview​

Compatibility​

Installation​

Prerequisites​

S3/MinIO credentials Secret​

Hive Metastore​

Maintenance jobs​

Expire Snapshots​

Remove Orphan Files​

Rewrite Data Files​

Spark configuration​

Spark dependency (subchart)​

Maintenance container image​

Spark configuration​

Main parameters​

Uninstallation​

What is Apache Iceberg and why does it need maintenance?

Overview

Compatibility

Installation

Prerequisites

S3/MinIO credentials Secret

Hive Metastore

Maintenance jobs

Expire Snapshots

Remove Orphan Files

Rewrite Data Files

Spark configuration

Spark dependency (subchart)

Maintenance container image

Spark configuration

Main parameters

Uninstallation