Skip to main content
Version 3.0.0

Integrations — JupyterLab

Integration with Spark

The tdp-jupyter chart integrates with Apache Spark via the tdpSparkIntegration mechanism. When the integration is enabled, a ConfigMap (tdp-jupyter-spark-integration) is created with spark-defaults.conf configuration and a jupyter-spark-env.sh helper script.

Notebook pods mount this ConfigMap at /opt/bitnami/spark/conf and run the script during postStart, so that each Spark session automatically finds the correct master.

For end users, the main choices are straightforward:

  • use local PySpark for quick tests and development;
  • use an external Spark cluster when you want distributed processing;
  • use Iceberg from the notebook only after Spark integration and the Iceberg catalog are already configured in the environment.

Modes of operation

ModetdpSparkIntegration.enabledResolved value for spark.masterTypical use
Local PySparkfalselocal[*]Runs Spark inside the notebook pod itself (default for development)
External clustertruespark://<release>-spark-master-svc.<namespace>.svc.cluster.local:7077Connects to an existing Spark deployment
tip

The spark.master parameter in values.yaml is left empty by default. The template chooses the correct value at render time based on tdpSparkIntegration.enabled. You can still supply a custom URL if needed.

Components involved

ComponentPurpose
templates/spark-integration-configmap.yamlRenders Spark defaults and the environment helper script
singleuser.extraEnvDefines Spark-related environment variables for each notebook pod
singleuser.lifecycleHooks.postStartSources jupyter-spark-env.sh before JupyterLab starts
singleuser.networkPolicy.egressAllows notebook pods to reach the Spark master and auxiliary services

Environment variables injected into notebook pods

SPARK_HOME=/opt/bitnami/spark
PYTHONPATH=/opt/bitnami/spark/python:/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip
SPARK_CONF_DIR=/opt/bitnami/spark/conf
PYSPARK_PYTHON=/opt/conda/envs/py312/bin/python
PYSPARK_DRIVER_PYTHON=/opt/conda/envs/py312/bin/python
SPARK_MASTER_URL=<auto> # local[*] or spark://... based on tdpSparkIntegration.enabled
SPARK_DRIVER_PORT=2222
SPARK_BLOCKMANAGER_PORT=7777

Volumes mounted in notebook pods

PathTypeContents
/opt/bitnami/spark/confConfigMapspark-defaults.conf and helper scripts
/tmp/spark-localemptyDirSpark shuffle and temporary data
/tmp/spark-logsemptyDirSpark driver logs

How to configure

Mode 1 — Local PySpark (default)

Does not require an external Spark cluster. Spark runs inside the notebook pod with local[*]:

tdpSparkIntegration:
enabled: false
deploySparkCluster: false
configMap:
sparkConfig:
"spark.master": "" # resolves to local[*]

Test in a notebook:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Local PySpark").getOrCreate()
print(spark.sparkContext.master) # local[*]

Mode 2 — External Spark cluster

Connects notebooks to an existing Spark deployment in the Kubernetes cluster:

tdpSparkIntegration:
enabled: true
deploySparkCluster: false # false = point to existing deployment
configMap:
sparkConfig:
"spark.kubernetes.namespace": "tdp-project" # optional
"spark.master": "" # resolves to spark://<release>-spark-master-svc.<ns>:7077
"spark.driver.host": "" # leave empty to use the notebook admin service
"spark.executor.instances": "2"
"spark.executor.memory": "4g"
"spark.executor.cores": "3"

tdp-spark:
spark:
worker:
replicaCount: 2
resources:
limits:
cpu: 4
memory: 6Gi

Ensure the Spark master service is reachable from the notebook namespace (e.g. tdp-spark-master-svc.tdp-project.svc.cluster.local:7077).

NetworkPolicy considerations

  • Notebook pods add an egress rule matching any Spark master (app.kubernetes.io/component: master, app.kubernetes.io/name: spark).
  • If the Spark chart has its own NetworkPolicy, allow inbound connections from the notebook namespace.

Mode 3 — Bundled Spark cluster (optional)

Set tdpSparkIntegration.deploySparkCluster: true to install the tdp-spark subchart alongside JupyterHub:

tdpSparkIntegration:
enabled: true
deploySparkCluster: true

Adjust the stack/charts/tdp/tdp-spark values as required.

Using Iceberg from Jupyter

Iceberg support in Jupyter is not a separate tdp-jupyter integration. In practice it happens through Spark:

  • the notebook connects to Spark;
  • Spark must know the Iceberg catalog;
  • the Iceberg catalog needs access to Hive Metastore and S3/MinIO storage.

Therefore:

note

Do not treat Iceberg as mandatory for Jupyter. It is an optional scenario for notebooks that need to query or maintain Iceberg tables through Spark.


Installing or upgrading JupyterHub

Terminal input
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-jupyter \
-n <namespace> \
-f values.yaml
After upgrades

Whenever ConfigMaps or environment variables are modified, restart user pods (Stop Server → Start Server in JupyterHub) for the new settings to take effect.


Verification checklist

  1. Pods running

    Terminal input
    kubectl get pods -n <namespace> | grep jupyter
  2. Spark ConfigMap created

    Terminal input
    kubectl get configmap tdp-jupyter-spark-integration -n <namespace> -o yaml
  3. Network connectivity (from a notebook pod)

    Terminal input
    kubectl exec -n <namespace> <pod> -- \
    curl -sv tdp-spark-master-svc.<namespace>.svc.cluster.local:7077

Testing the integration

Provided test notebook

The chart includes a test notebook (tdp-jupyter-spark-test ConfigMap). To extract it:

Terminal input
kubectl get configmap tdp-jupyter-spark-test -n <namespace> \
-o jsonpath='{.data.spark-integration-test\.ipynb}' \
> spark-integration-test.ipynb

Upload the notebook through JupyterLab and execute each cell.

Manual smoke test

Run the following code in a notebook to validate the integration:

import os
from pyspark.sql import SparkSession

print("SPARK_HOME:", os.environ.get("SPARK_HOME"))
print("SPARK_MASTER_URL:", os.environ.get("SPARK_MASTER_URL"))

spark = SparkSession.builder.appName("TDP-Jupyter Smoke Test").getOrCreate()
print("Spark version:", spark.version)
print("Active master:", spark.sparkContext.master)

spark.range(5).show()
spark.stop()

Troubleshooting

SymptomLikely causeSuggested action
JAVA_GATEWAY_EXITED or Py4J errorsSPARK_HOME/PYTHONPATH misconfiguredEnsure singleuser.extraEnv uses /opt/bitnami/spark paths
IllegalStateException: Cannot call methods on a stopped SparkContextSpark master unreachable or NetworkPolicy blocking egress/ingressConfirm tdpSparkIntegration.enabled matches intent, verify Spark service, adjust NetworkPolicies
Notebook pod crashes at startup (ImportError for zmq)PYTHONPATH polluted with PySpark site-packagesDo not append /opt/conda/envs/py312/lib/python3.12/site-packages to PYTHONPATH
Spark driver cannot bind/communicateSPARK_DRIVER_HOST not resolvableLeave blank to use the notebook admin service or supply a reachable DNS entry

Useful debug commands

Terminal input
# Notebook pod logs
kubectl logs -n <namespace> <notebook-pod>

# Inspect Spark env vars inside the pod
kubectl exec -n <namespace> <notebook-pod> -- env | grep SPARK

# List mounted files
kubectl exec -n <namespace> <notebook-pod> -- ls -R /opt/bitnami/spark/conf

# Check Spark master service endpoints
kubectl get svc -n <namespace> | grep spark-master

Advanced customisation

  • Append extra Spark properties under tdpSparkIntegration.configMap.sparkConfig.
  • Define notebook size presets via singleuser.profileList and adjust per-profile Spark environment variables.
  • When running multiple Spark clusters, override spark.master per profile or via user environment.

Cleanup

Terminal input
helm uninstall <release> -n <namespace>
kubectl delete configmap tdp-jupyter-spark-integration -n <namespace>