Version Next

Spark Configuration

What is Apache Spark in TDP?

Apache Spark is TDP's large-scale data processing engine.

While Trino is optimized for interactive SQL queries, Spark is optimized for batch data transformations and ETL pipelines — processing massive data volumes, applying complex transformations, and writing results back to storage.

In TDP Kubernetes, Spark is used in two main ways:

Through Airflow: scheduled pipelines that process data with SparkSubmitOperator
Through JupyterHub: interactive notebook analyses that connect to the Spark cluster

Learn more

See Apache Spark — Concepts for a complete overview of the tool, its architecture and how it works.

Master-worker architecture

Spark in TDP uses the standalone architecture (master + workers):

Component	Role
Master	Manages the cluster, accepts applications and distributes work
Workers	Run the executors for Spark applications
History Server	Web interface for viewing execution history

The Spark application driver (the job submitter) can be Airflow, a Jupyter notebook, or any client pointing to spark://<master>:7077.

Catalogs and table formats

The tdp-spark chart supports optional blocks for Delta Lake and Iceberg — the two open table formats in TDP.

Enabling these blocks (deltaLake.enabled, iceberg.enabled) configures the Spark properties required to read and write in these formats, using Hive Metastore as the table catalog.

This page describes the Apache Spark configuration in TDP Kubernetes. The tdp-spark chart packages an upstream Spark chart and adds platform-specific configuration and integrations (Jupyter, Airflow, optional Delta Lake and Iceberg blocks).

Overview

Property	Value
Chart	`tdp-spark`
Spark Version	4.0.0
Chart Version	3.0.0
Registry (examples)	`oci://registry.tecnisys.com.br/tdp/charts/tdp-spark`

Main parameters (chart)

Parameter	Description	Typical default
`spark.enabled`	Enable Spark	`true`
`spark.image.repository` / `spark.image.tag`	Spark image	See chart `values.yaml`
`spark.master.resources`	Master resources	See `values.yaml`
`spark.worker.replicaCount`	Workers	`2`
`spark.worker.autoscaling.enabled`	HPA on workers	`false`
`spark.historyServer.enabled`	History Server	`true`
`spark.sparkConf`	Spark configuration map	See `values.yaml`
`hadoopConfig`	Values for `core-site.xml` rendered by the chart	See `values.yaml`
`customSparkConfig.properties`	Content of `spark-defaults.conf`	See `values.yaml`
`integration.jupyter.enabled`	Jupyter integration block	`true`
`integration.airflow.enabled`	Airflow integration block	`false`
`deltaLake.enabled`	Optional Delta Lake block	`false`
`iceberg.enabled`	Optional Iceberg block	`false`

The chart uses a serviceAccount block at the root and spark.serviceAccount for the upstream dependency; both are in values.yaml.

Installation

Use your organization's OCI registry and replace <release> and <namespace>:

Example

helm install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
  -n <namespace> --create-namespace

Adjust resources and replicas via your values file or --set, according to your environment needs (development typically uses less CPU/memory than production).

Spark configuration (`sparkConf` and files)

Executor and driver (example)

spark:
  sparkConf:
    "spark.executor.instances": "2"
    "spark.executor.cores": "2"
    "spark.executor.memory": "4g"
    "spark.driver.memory": "2g"
    "spark.driver.cores": "1"

Event log (local example)

spark:
  sparkConf:
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "file:///tmp/spark-events"
    "spark.history.fs.logDirectory": "file:///tmp/spark-events"

Object storage (S3/S3A)

Credentials and endpoints must be provided by the client (Secrets, values injected by CI/CD, etc.). The chart accepts properties in:

spark.sparkConf (e.g. spark.hadoop.fs.s3a.*)
hadoopConfig (e.g. fs.s3a.*)
customSparkConfig.properties

Example with placeholders:

spark:
  sparkConf:
    "spark.hadoop.fs.s3a.endpoint": "https://<s3-endpoint>"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.defaultFS": "s3a://<bucket-warehouse>"

Delta Lake and Iceberg

The deltaLake and iceberg blocks in values.yaml are optional. Enabling them alone does not apply all the required Spark properties: use customSparkConfig.properties and spark.sparkConf for the dependencies and parameters required by your version of Spark and the connectors.

Illustrative example (Delta — adjust packages/versions to your image):

spark:
  sparkConf:
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.warehouse.dir": "s3a://<bucket>/delta"

Illustrative example (Iceberg with Hive Metastore — replace URI and warehouse):

spark:
  sparkConf:
    "spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
    "spark.sql.catalog.iceberg.type": "hive"
    "spark.sql.catalog.iceberg.uri": "thrift://<hive-metastore-host>:9083"
    "spark.sql.catalog.iceberg.warehouse": "s3a://<bucket>/hive"
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

Horizontal Pod Autoscaling (workers)

spark:
  worker:
    autoscaling:
      enabled: true
      minReplicas: 2
      maxReplicas: 10
      targetCPU: 70
      targetMemory: 80

HPA requires Metrics Server in the cluster.

Integrations (Jupyter and Airflow)

The chart renders ConfigMaps with configuration for other components:

integration.jupyter.sparkConfig → defaults for Jupyter environments
integration.airflow.sparkConfig → defaults for Spark clients in Airflow

Use the actual hostname and port of the Spark master in your namespace (discover with kubectl get svc -n <namespace>). Generic example:

integration:
  jupyter:
    enabled: true
    sparkConfig:
      "spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
  airflow:
    enabled: true
    sparkConfig:
      "spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
      "spark.driver.memory": "1g"
      "spark.executor.memory": "2g"
      "spark.executor.cores": "1"

Enabling only the Jupyter integration:

Example
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
  -n <namespace> \
  --set integration.jupyter.enabled=true

Access and troubleshooting

kubectl -n <namespace> get svc -l app.kubernetes.io/instance=<release>
kubectl -n <namespace> get pods
kubectl -n <namespace> get events --sort-by=.lastTimestamp

Uninstallation

helm uninstall <release> -n <namespace>

For the full list of parameters, see the chart's values.yaml.

What is Apache Spark in TDP?​

Master-worker architecture​

Catalogs and table formats​

Overview​

Main parameters (chart)​

Installation​

Spark configuration (sparkConf and files)​

Executor and driver (example)​

Event log (local example)​

Object storage (S3/S3A)​

Delta Lake and Iceberg​

Horizontal Pod Autoscaling (workers)​

Integrations (Jupyter and Airflow)​

Access and troubleshooting​

Uninstallation​