Skip to main content
Version Next

Spark Configuration

What is Apache Spark in TDP?

Apache Spark is TDP's large-scale data processing engine.

While Trino is optimized for interactive SQL queries, Spark is optimized for batch data transformations and ETL pipelines — processing massive data volumes, applying complex transformations, and writing results back to storage.

In TDP Kubernetes, Spark is used in two main ways:

  1. Through Airflow: scheduled pipelines that process data with SparkSubmitOperator
  2. Through JupyterHub: interactive notebook analyses that connect to the Spark cluster
Learn more

See Apache Spark — Concepts for a complete overview of the tool, its architecture and how it works.

Master-worker architecture

Spark in TDP uses the standalone architecture (master + workers):

ComponentRole
MasterManages the cluster, accepts applications and distributes work
WorkersRun the executors for Spark applications
History ServerWeb interface for viewing execution history

The Spark application driver (the job submitter) can be Airflow, a Jupyter notebook, or any client pointing to spark://<master>:7077.

Catalogs and table formats

The tdp-spark chart supports optional blocks for Delta Lake and Iceberg — the two open table formats in TDP.

Enabling these blocks (deltaLake.enabled, iceberg.enabled) configures the Spark properties required to read and write in these formats, using Hive Metastore as the table catalog.

This page describes the Apache Spark configuration in TDP Kubernetes. The tdp-spark chart packages an upstream Spark chart and adds platform-specific configuration and integrations (Jupyter, Airflow, optional Delta Lake and Iceberg blocks).

Overview

PropertyValue
Charttdp-spark
Spark Version4.0.0
Chart Version3.0.0
Registry (examples)oci://registry.tecnisys.com.br/tdp/charts/tdp-spark

Main parameters (chart)

ParameterDescriptionTypical default
spark.enabledEnable Sparktrue
spark.image.repository / spark.image.tagSpark imageSee chart values.yaml
spark.master.resourcesMaster resourcesSee values.yaml
spark.worker.replicaCountWorkers2
spark.worker.autoscaling.enabledHPA on workersfalse
spark.historyServer.enabledHistory Servertrue
spark.sparkConfSpark configuration mapSee values.yaml
hadoopConfigValues for core-site.xml rendered by the chartSee values.yaml
customSparkConfig.propertiesContent of spark-defaults.confSee values.yaml
integration.jupyter.enabledJupyter integration blocktrue
integration.airflow.enabledAirflow integration blockfalse
deltaLake.enabledOptional Delta Lake blockfalse
iceberg.enabledOptional Iceberg blockfalse

The chart uses a serviceAccount block at the root and spark.serviceAccount for the upstream dependency; both are in values.yaml.

Installation

Use your organization's OCI registry and replace <release> and <namespace>:

Example
helm install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
-n <namespace> --create-namespace

Adjust resources and replicas via your values file or --set, according to your environment needs (development typically uses less CPU/memory than production).

Spark configuration (sparkConf and files)

Executor and driver (example)

spark:
sparkConf:
"spark.executor.instances": "2"
"spark.executor.cores": "2"
"spark.executor.memory": "4g"
"spark.driver.memory": "2g"
"spark.driver.cores": "1"

Event log (local example)

spark:
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "file:///tmp/spark-events"
"spark.history.fs.logDirectory": "file:///tmp/spark-events"

Object storage (S3/S3A)

Credentials and endpoints must be provided by the client (Secrets, values injected by CI/CD, etc.). The chart accepts properties in:

  • spark.sparkConf (e.g. spark.hadoop.fs.s3a.*)
  • hadoopConfig (e.g. fs.s3a.*)
  • customSparkConfig.properties

Example with placeholders:

spark:
sparkConf:
"spark.hadoop.fs.s3a.endpoint": "https://<s3-endpoint>"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.defaultFS": "s3a://<bucket-warehouse>"

Delta Lake and Iceberg

The deltaLake and iceberg blocks in values.yaml are optional. Enabling them alone does not apply all the required Spark properties: use customSparkConfig.properties and spark.sparkConf for the dependencies and parameters required by your version of Spark and the connectors.

Illustrative example (Delta — adjust packages/versions to your image):

spark:
sparkConf:
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.warehouse.dir": "s3a://<bucket>/delta"

Illustrative example (Iceberg with Hive Metastore — replace URI and warehouse):

spark:
sparkConf:
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.iceberg.type": "hive"
"spark.sql.catalog.iceberg.uri": "thrift://<hive-metastore-host>:9083"
"spark.sql.catalog.iceberg.warehouse": "s3a://<bucket>/hive"
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

Horizontal Pod Autoscaling (workers)

spark:
worker:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPU: 70
targetMemory: 80

HPA requires Metrics Server in the cluster.

Integrations (Jupyter and Airflow)

The chart renders ConfigMaps with configuration for other components:

  • integration.jupyter.sparkConfig → defaults for Jupyter environments
  • integration.airflow.sparkConfig → defaults for Spark clients in Airflow

Use the actual hostname and port of the Spark master in your namespace (discover with kubectl get svc -n <namespace>). Generic example:

integration:
jupyter:
enabled: true
sparkConfig:
"spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
airflow:
enabled: true
sparkConfig:
"spark.master": "spark://<spark-master-service>.<namespace>.svc.cluster.local:7077"
"spark.driver.memory": "1g"
"spark.executor.memory": "2g"
"spark.executor.cores": "1"

Enabling only the Jupyter integration:

Example
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-spark \
-n <namespace> \
--set integration.jupyter.enabled=true

Access and troubleshooting

kubectl -n <namespace> get svc -l app.kubernetes.io/instance=<release>
kubectl -n <namespace> get pods
kubectl -n <namespace> get events --sort-by=.lastTimestamp

Uninstallation

helm uninstall <release> -n <namespace>

For the full list of parameters, see the chart's values.yaml.