Integrations — JupyterLab
Integration with Spark
The tdp-jupyter chart integrates with Apache Spark via the tdpSparkIntegration mechanism. When the integration is enabled, a ConfigMap (tdp-jupyter-spark-integration) is created with spark-defaults.conf configuration and a jupyter-spark-env.sh helper script.
Notebook pods mount this ConfigMap at /opt/bitnami/spark/conf and run the script during postStart, so that each Spark session automatically finds the correct master.
For end users, the main choices are straightforward:
- use local PySpark for quick tests and development;
- use an external Spark cluster when you want distributed processing;
- use Iceberg from the notebook only after Spark integration and the Iceberg catalog are already configured in the environment.
Modes of operation
| Mode | tdpSparkIntegration.enabled | Resolved value for spark.master | Typical use |
|---|---|---|---|
| Local PySpark | false | local[*] | Runs Spark inside the notebook pod itself (default for development) |
| External cluster | true | spark://<release>-spark-master-svc.<namespace>.svc.cluster.local:7077 | Connects to an existing Spark deployment |
The spark.master parameter in values.yaml is left empty by default. The template chooses the correct value at render time based on tdpSparkIntegration.enabled. You can still supply a custom URL if needed.
Components involved
| Component | Purpose |
|---|---|
templates/spark-integration-configmap.yaml | Renders Spark defaults and the environment helper script |
singleuser.extraEnv | Defines Spark-related environment variables for each notebook pod |
singleuser.lifecycleHooks.postStart | Sources jupyter-spark-env.sh before JupyterLab starts |
singleuser.networkPolicy.egress | Allows notebook pods to reach the Spark master and auxiliary services |
Environment variables injected into notebook pods
SPARK_HOME=/opt/bitnami/spark
PYTHONPATH=/opt/bitnami/spark/python:/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip
SPARK_CONF_DIR=/opt/bitnami/spark/conf
PYSPARK_PYTHON=/opt/conda/envs/py312/bin/python
PYSPARK_DRIVER_PYTHON=/opt/conda/envs/py312/bin/python
SPARK_MASTER_URL=<auto> # local[*] or spark://... based on tdpSparkIntegration.enabled
SPARK_DRIVER_PORT=2222
SPARK_BLOCKMANAGER_PORT=7777
Volumes mounted in notebook pods
| Path | Type | Contents |
|---|---|---|
/opt/bitnami/spark/conf | ConfigMap | spark-defaults.conf and helper scripts |
/tmp/spark-local | emptyDir | Spark shuffle and temporary data |
/tmp/spark-logs | emptyDir | Spark driver logs |
How to configure
Mode 1 — Local PySpark (default)
Does not require an external Spark cluster. Spark runs inside the notebook pod with local[*]:
tdpSparkIntegration:
enabled: false
deploySparkCluster: false
configMap:
sparkConfig:
"spark.master": "" # resolves to local[*]
Test in a notebook:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Local PySpark").getOrCreate()
print(spark.sparkContext.master) # local[*]
Mode 2 — External Spark cluster
Connects notebooks to an existing Spark deployment in the Kubernetes cluster:
tdpSparkIntegration:
enabled: true
deploySparkCluster: false # false = point to existing deployment
configMap:
sparkConfig:
"spark.kubernetes.namespace": "tdp-project" # optional
"spark.master": "" # resolves to spark://<release>-spark-master-svc.<ns>:7077
"spark.driver.host": "" # leave empty to use the notebook admin service
"spark.executor.instances": "2"
"spark.executor.memory": "4g"
"spark.executor.cores": "3"
tdp-spark:
spark:
worker:
replicaCount: 2
resources:
limits:
cpu: 4
memory: 6Gi
Ensure the Spark master service is reachable from the notebook namespace (e.g. tdp-spark-master-svc.tdp-project.svc.cluster.local:7077).
NetworkPolicy considerations
- Notebook pods add an egress rule matching any Spark master (
app.kubernetes.io/component: master,app.kubernetes.io/name: spark). - If the Spark chart has its own NetworkPolicy, allow inbound connections from the notebook namespace.
Mode 3 — Bundled Spark cluster (optional)
Set tdpSparkIntegration.deploySparkCluster: true to install the tdp-spark subchart alongside JupyterHub:
tdpSparkIntegration:
enabled: true
deploySparkCluster: true
Adjust the stack/charts/tdp/tdp-spark values as required.
Using Iceberg from Jupyter
Iceberg support in Jupyter is not a separate tdp-jupyter integration. In practice it happens through Spark:
- the notebook connects to Spark;
- Spark must know the Iceberg catalog;
- the Iceberg catalog needs access to Hive Metastore and S3/MinIO storage.
Therefore:
- Jupyter configuration stays on this page;
- Iceberg catalog configuration is in Integrations — Iceberg;
- Spark configuration is in Integrations — Spark.
Do not treat Iceberg as mandatory for Jupyter. It is an optional scenario for notebooks that need to query or maintain Iceberg tables through Spark.
Installing or upgrading JupyterHub
helm upgrade --install <release> \
oci://registry.tecnisys.com.br/tdp/charts/tdp-jupyter \
-n <namespace> \
-f values.yaml
Whenever ConfigMaps or environment variables are modified, restart user pods (Stop Server → Start Server in JupyterHub) for the new settings to take effect.
Verification checklist
-
Pods running
Terminal inputkubectl get pods -n <namespace> | grep jupyter -
Spark ConfigMap created
Terminal inputkubectl get configmap tdp-jupyter-spark-integration -n <namespace> -o yaml -
Network connectivity (from a notebook pod)
Terminal inputkubectl exec -n <namespace> <pod> -- \
curl -sv tdp-spark-master-svc.<namespace>.svc.cluster.local:7077
Testing the integration
Provided test notebook
The chart includes a test notebook (tdp-jupyter-spark-test ConfigMap). To extract it:
kubectl get configmap tdp-jupyter-spark-test -n <namespace> \
-o jsonpath='{.data.spark-integration-test\.ipynb}' \
> spark-integration-test.ipynb
Upload the notebook through JupyterLab and execute each cell.
Manual smoke test
Run the following code in a notebook to validate the integration:
import os
from pyspark.sql import SparkSession
print("SPARK_HOME:", os.environ.get("SPARK_HOME"))
print("SPARK_MASTER_URL:", os.environ.get("SPARK_MASTER_URL"))
spark = SparkSession.builder.appName("TDP-Jupyter Smoke Test").getOrCreate()
print("Spark version:", spark.version)
print("Active master:", spark.sparkContext.master)
spark.range(5).show()
spark.stop()
Troubleshooting
| Symptom | Likely cause | Suggested action |
|---|---|---|
JAVA_GATEWAY_EXITED or Py4J errors | SPARK_HOME/PYTHONPATH misconfigured | Ensure singleuser.extraEnv uses /opt/bitnami/spark paths |
IllegalStateException: Cannot call methods on a stopped SparkContext | Spark master unreachable or NetworkPolicy blocking egress/ingress | Confirm tdpSparkIntegration.enabled matches intent, verify Spark service, adjust NetworkPolicies |
Notebook pod crashes at startup (ImportError for zmq) | PYTHONPATH polluted with PySpark site-packages | Do not append /opt/conda/envs/py312/lib/python3.12/site-packages to PYTHONPATH |
| Spark driver cannot bind/communicate | SPARK_DRIVER_HOST not resolvable | Leave blank to use the notebook admin service or supply a reachable DNS entry |
Useful debug commands
# Notebook pod logs
kubectl logs -n <namespace> <notebook-pod>
# Inspect Spark env vars inside the pod
kubectl exec -n <namespace> <notebook-pod> -- env | grep SPARK
# List mounted files
kubectl exec -n <namespace> <notebook-pod> -- ls -R /opt/bitnami/spark/conf
# Check Spark master service endpoints
kubectl get svc -n <namespace> | grep spark-master
Advanced customisation
- Append extra Spark properties under
tdpSparkIntegration.configMap.sparkConfig. - Define notebook size presets via
singleuser.profileListand adjust per-profile Spark environment variables. - When running multiple Spark clusters, override
spark.masterper profile or via user environment.
Cleanup
helm uninstall <release> -n <namespace>
kubectl delete configmap tdp-jupyter-spark-integration -n <namespace>