Skip to main content
Version Next

Iceberg Configuration

The tdp-iceberg chart packages maintenance jobs for Apache Iceberg 1.10.0 tables, using Apache Spark as the runtime.

What is Apache Iceberg and why does it need maintenance?

Apache Iceberg is an open table format for large datasets, designed to overcome the limitations of Hive.

Like Delta Lake, Iceberg maintains snapshots — immutable versions of the table for each write operation.

This enables consistent reads, time travel, and ACID operations.

With continuous use, Iceberg tables accumulate:

  • Old snapshots: each write creates a new snapshot; old ones need to be expired to free metadata storage space
  • Orphan files: files that were created but never referenced by a committed snapshot
  • Small files: frequent write operations generate many small files that degrade read performance

The tdp-iceberg chart creates Kubernetes CronJobs that run these maintenance operations on a schedule, using Spark to process the tables in S3-compatible storage.

Learn more

See Apache Iceberg — Concepts for a complete overview of the format, snapshots, and use cases.

Overview

This chart provides CronJobs for scheduled maintenance of Iceberg tables:

  • Expire snapshots — removes old snapshots based on a retention policy
  • Remove orphan files — removes orphan files not tracked by the metadata
  • Rewrite data files — rewrites data files for compaction and optimization

The chart uses the upstream Spark chart as a dependency to provide the runtime.

Compatibility

ComponentVersion
Spark4.0.0
Iceberg (Spark runtime)1.10.0
Scala2.13

Installation

Terminal input
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-iceberg \
-n <namespace> --create-namespace

Prerequisites

S3/MinIO credentials Secret

The maintenance CronJobs require a Secret named minio-credentials with the following keys:

  • access-key
  • secret-key
Terminal input
kubectl -n <namespace> create secret generic minio-credentials \
--from-literal=access-key='<ACCESS_KEY>' \
--from-literal=secret-key='<SECRET_KEY>'

Hive Metastore

By default, the Iceberg catalog is configured to use a Hive Metastore:

maintenance:
spark:
config:
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"
Tip

The typical example value is thrift://metastore.hive-metastore.svc.cluster.local:9083; adjust the host and namespace to your Hive Metastore.

Maintenance jobs

Jobs are configured under maintenance.jobs.*:

Expire Snapshots

Removes old snapshots to free storage space:

maintenance:
jobs:
expireSnapshots:
enabled: true
schedule: "0 2 * * *" # Daily at 2 AM
command: |
# Snapshot expiration script
spark-sql --conf ... -e "CALL iceberg.system.expire_snapshots(...)"

Remove Orphan Files

Removes orphan files that are no longer tracked by Iceberg metadata:

maintenance:
jobs:
removeOrphanFiles:
enabled: true
schedule: "0 3 * * 0" # Weekly on Sundays at 3 AM
command: |
# Orphan file removal script
spark-sql --conf ... -e "CALL iceberg.system.remove_orphan_files(...)"

Rewrite Data Files

Rewrites data files for compaction and performance optimization:

maintenance:
jobs:
rewriteDataFiles:
enabled: false
schedule: "0 4 * * 6" # Saturdays at 4 AM
command: |
# Data file rewrite script
spark-sql --conf ... -e "CALL iceberg.system.rewrite_data_files(...)"

Each job supports the following parameters:

ParameterDescription
enabledEnable or disable the CronJob
scheduleCron expression for scheduling
commandShell script executed by the container (run verbatim)

Spark configuration

Spark dependency (subchart)

The upstream Spark chart is configured under the top-level spark.* keys:

  • spark.image.* controls the Spark image used by master/worker Pods
  • spark.commonLabels and spark.master.podLabels / spark.worker.podLabels apply labels to resources

Maintenance container image

The CronJobs use maintenance.spark.image.* for the container image:

maintenance:
spark:
enabled: true
image:
repository: "docker.io/bitnamilegacy/spark"
tag: "4.0.0-debian-12-r0"

Spark configuration

maintenance:
spark:
config:
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<metastore-service>.<namespace>.svc.cluster.local:9083"
"spark.hadoop.fs.s3a.endpoint": "http://<s3-host>.<namespace>.svc.cluster.local:9000"
"spark.hadoop.fs.s3a.path.style.access": "true"

Main parameters

ParameterDescriptionDefault
maintenance.spark.enabledEnable Spark dependencytrue
maintenance.spark.image.repositorySpark imagedocker.io/bitnamilegacy/spark
maintenance.spark.image.tagSpark image tag4.0.0-debian-12-r0
maintenance.jobs.expireSnapshots.enabledEnable expire snapshotstrue
maintenance.jobs.expireSnapshots.scheduleExpire snapshots cron0 2 * * *
maintenance.jobs.removeOrphanFiles.enabledEnable orphan file removaltrue
maintenance.jobs.removeOrphanFiles.scheduleOrphan file removal cron0 3 * * 0
maintenance.jobs.rewriteDataFiles.enabledEnable data file rewritefalse
maintenance.jobs.rewriteDataFiles.scheduleData file rewrite cron0 4 * * 6

Uninstallation

Terminal input
helm uninstall <release> -n <namespace>