Spark Configuration
What is Apache Spark in TDP?
Apache Spark is TDP's large-scale data processing engine.
While Trino is optimized for interactive SQL queries, Spark is optimized for batch data transformations and ETL pipelines — processing massive data volumes, applying complex transformations, and writing results back to storage.
In TDP Kubernetes, Spark is used in two main ways:
- Through Airflow: scheduled pipelines that process data with SparkSubmitOperator
- Through JupyterHub: interactive notebook analyses that connect to the Spark cluster
See Apache Spark — Concepts for a complete overview of the tool, its architecture and how it works.
Master-worker architecture
Spark in TDP uses the standalone architecture (master + workers):
| Component | Role |
|---|---|
| Master | Manages the cluster, accepts applications and distributes work |
| Workers | Run the executors for Spark applications |
| History Server | Web interface for viewing execution history |
The Spark application driver (the job submitter) can be Airflow, a Jupyter notebook, or any client pointing to spark://<master>:7077.
Catalogs and table formats
The tdp-spark chart supports optional blocks for Delta Lake and Iceberg — the two open table formats in TDP.
Enabling these blocks (deltaLake.enabled, iceberg.enabled) configures the Spark properties required to read and write in these formats, using Hive Metastore as the table catalog.
This page describes the Apache Spark configuration in TDP Kubernetes. The tdp-spark chart packages an upstream Spark chart and adds platform-specific configuration and integrations (Jupyter, Airflow, optional Delta Lake and Iceberg blocks).
Overview
| Property | Value |
|---|---|
| Chart | tdp-spark |
| Spark Version | 4.0.0 |
| Chart Version | 3.0.0 |
| Registry (examples) | oci://registry.tecnisys.com.br/tdp/charts/tdp-spark |
Main parameters (chart)
| Parameter | Description | Typical default |
|---|---|---|
spark.enabled | Enable Spark | true |
spark.image.repository / spark.image.tag | Spark image | See chart values.yaml |
spark.master.resources | Master resources | See values.yaml |
spark.worker.replicaCount | Workers | 2 |
spark.worker.autoscaling.enabled | HPA on workers | false |
spark.historyServer.enabled | History Server | true |
spark.sparkConf | Spark configuration map | See values.yaml |
hadoopConfig | Values for core-site.xml rendered by the chart | See values.yaml |
customSparkConfig.properties | Content of spark-defaults.conf | See values.yaml |
integration.jupyter.enabled | Jupyter integration block | true |
integration.airflow.enabled | Airflow integration block | false |
deltaLake.enabled | Optional Delta Lake block | false |
iceberg.enabled | Optional Iceberg block | false |
The chart uses a serviceAccount block at the root and spark.serviceAccount for the upstream dependency; both are in values.yaml.
Installation
Use your organization's OCI registry and replace <release> and <namespace>:
helm install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
-n <namespace> --create-namespace
Adjust resources and replicas via your values file or --set, according to your environment needs (development typically uses less CPU/memory than production).
Spark configuration (sparkConf and files)
Executor and driver (example)
spark:
sparkConf:
"spark.executor.instances": "2"
"spark.executor.cores": "2"
"spark.executor.memory": "4g"
"spark.driver.memory": "2g"
"spark.driver.cores": "1"
Event log (local example)
spark:
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "file:///tmp/spark-events"
"spark.history.fs.logDirectory": "file:///tmp/spark-events"
Object storage (S3/S3A)
Credentials and endpoints must be provided by the client (Secrets, values injected by CI/CD, etc.). The chart accepts properties in:
spark.sparkConf(e.g.spark.hadoop.fs.s3a.*)hadoopConfig(e.g.fs.s3a.*)customSparkConfig.properties
Example with placeholders:
spark:
sparkConf:
"spark.hadoop.fs.s3a.endpoint": "https://<s3-endpoint>"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.defaultFS": "s3a://<bucket-warehouse>"
Delta Lake and Iceberg
The deltaLake and iceberg blocks in values.yaml are optional. Enabling them alone does not apply all the required Spark properties: use customSparkConfig.properties and spark.sparkConf for the dependencies and parameters required by your version of Spark and the connectors.
Illustrative example (Delta — adjust packages/versions to your image):
spark:
sparkConf:
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.warehouse.dir": "s3a://<bucket>/delta"
Illustrative example (Iceberg with Hive Metastore — replace URI and warehouse):
spark:
sparkConf:
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<hive-metastore-host>:9083"
"spark.sql.catalog.iceberg.warehouse": "s3a://<bucket>/hive"
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
Horizontal Pod Autoscaling (workers)
spark:
worker:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPU: 70
targetMemory: 80
HPA requires Metrics Server in the cluster.
Integrations (Jupyter and Airflow)
The chart renders ConfigMaps with configuration for other components:
integration.jupyter.sparkConfig→ defaults for Jupyter environmentsintegration.airflow.sparkConfig→ defaults for Spark clients in Airflow
Use the actual hostname and port of the Spark master in your namespace (discover with kubectl get svc -n <namespace>). Generic example:
integration:
jupyter:
enabled: true
sparkConfig:
"spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
airflow:
enabled: true
sparkConfig:
"spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
"spark.driver.memory": "1g"
"spark.executor.memory": "2g"
"spark.executor.cores": "1"
Enabling only the Jupyter integration:
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
-n <namespace> \
--set integration.jupyter.enabled=true
Access and troubleshooting
kubectl -n <namespace> get svc -l app.kubernetes.io/instance=<release>
kubectl -n <namespace> get pods
kubectl -n <namespace> get events --sort-by=.lastTimestamp
Uninstallation
helm uninstall <release> -n <namespace>
For the full list of parameters, see the chart's values.yaml.