Version 3.0

Integrations — JupyterLab

ChartVersion3.0.1TypeapplicationAppVersion5.3.0

CompatibilityKubernetes1.32+OpenShift4.19+Rancher2.10.x+

Integration overview

JupyterHub uses SQLite internally by default; an external PostgreSQL (tdp-postgresql) can replace it — see Jupyter Configuration.

Spark integration

The tdp-jupyter chart integrates with Apache Spark via the tdpSparkIntegration mechanism. When the integration is enabled, a ConfigMap (tdp-jupyter-spark-integration) is created with spark-defaults.conf settings and a helper script jupyter-spark-env.sh.

Notebook pods mount this ConfigMap at /opt/bitnami/spark/conf and run the script during postStart, so each Spark session automatically finds the correct master.

For the end user, the main decision is simple:

use local PySpark for quick testing and development;
use an external Spark cluster when you want to distribute processing;
use Iceberg from notebooks only after the Spark integration and the Iceberg catalog are already configured in the environment.

Operating modes

Mode	`tdpSparkIntegration.enabled`	Resolved `spark.master` value	Typical usage
Local PySpark	`false`	`local[*]`	Runs Spark inside the notebook pod (default for development)
External cluster	`true`	`spark://<RELEASE_NAME>-spark-master-svc.<NAMESPACE>.svc.cluster.local:7077`	Connects to an existing Spark deployment

tip

The spark.master entry in values.yaml is empty by default. The template chooses the correct value at render time based on tdpSparkIntegration.enabled. You can still provide a custom URL if needed.

Components involved

Component	Purpose
`templates/spark-integration-configmap.yaml`	Renders Spark defaults and environment helper script
`singleuser.extraEnv`	Sets Spark environment variables for each notebook pod
`singleuser.lifecycleHooks.postStart`	Runs `jupyter-spark-env.sh` before JupyterLab starts
`singleuser.networkPolicy.egress`	Allows notebook pods to reach Spark master and auxiliary services

Environment variables injected into notebook pods

SPARK_HOME=/opt/bitnami/spark
PYTHONPATH=/opt/bitnami/spark/python:/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip
SPARK_CONF_DIR=/opt/bitnami/spark/conf
PYSPARK_PYTHON=/opt/conda/envs/py312/bin/python
PYSPARK_DRIVER_PYTHON=/opt/conda/envs/py312/bin/python
SPARK_MASTER_URL=<AUTO_DETECTED>  # local[*] or spark://... based on tdpSparkIntegration.enabled
SPARK_DRIVER_PORT=2222
SPARK_BLOCKMANAGER_PORT=7777

Volumes mounted on notebook pods

Path	Type	Content
`/opt/bitnami/spark/conf`	ConfigMap	`spark-defaults.conf` and helper scripts
`/tmp/spark-local`	emptyDir	Spark temporary data and shuffle
`/tmp/spark-logs`	emptyDir	Spark driver logs

How to configure

Mode 1 — Local PySpark (default)

Does not require an external Spark cluster. Spark runs inside the notebook pod with local[*]:

tdpSparkIntegration:
  enabled: false
  deploySparkCluster: false
  configMap:
    sparkConfig:
      "spark.master": ""  # resolves to local[*]

Test in a notebook:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Local PySpark").getOrCreate()
print(spark.sparkContext.master)  # local[*]

Mode 2 — External Spark cluster

Connects notebooks to an existing Spark deployment in the Kubernetes cluster:

tdpSparkIntegration:
  enabled: true
  deploySparkCluster: false   # false = point to an existing deployment
  configMap:
    sparkConfig:
      "spark.kubernetes.namespace": "<NAMESPACE>"   # optional
      "spark.master": ""       # resolves to spark://<RELEASE_NAME>-spark-master-svc.<NAMESPACE>:7077
      "spark.driver.host": ""  # leave empty to use the notebook admin service
      "spark.executor.instances": "2"
      "spark.executor.memory": "4g"
      "spark.executor.cores": "3"

tdp-spark:
  spark:
    worker:
      replicaCount: 2
      resources:
        limits:
          cpu: 4
          memory: 6Gi

Make sure the Spark master service is accessible from the notebook namespace (e.g. tdp-spark-master-svc.tdp-project.svc.cluster.local:7077).

NetworkPolicy considerations

Notebook pods add an egress rule that matches any Spark master (app.kubernetes.io/component: master, app.kubernetes.io/name: spark).
If the Spark chart ships its own NetworkPolicy, allow inbound connections from the notebook namespace.
For Spark Workers to connect back to the notebook driver, also configure the ingress rule described in Security — JupyterLab.

Known limitations

Each notebook may start its own Spark session, with fixed ports pre-configured for the driver (2222) and the BlockManager (7777).
If multiple Spark sessions are opened in the same pod — for example, several active kernels — or if a previous session did not release resources correctly, Spark may find those ports already in use and raise a BindException.
When the environment uses Spark Connect, setting SPARK_CONNECT_PORT: "0" causes the endpoint to choose a random free port, specifically avoiding conflicts on the default port 15002; this setting does not address driver or BlockManager port conflicts.

tdp-jupyter:
  singleuser:
    extraEnv:
      SPARK_CONNECT_PORT: "0"   # random port, avoids conflicts when multiple notebooks run simultaneously

Mode 3 — Bundled Spark cluster (optional)

Set tdpSparkIntegration.deploySparkCluster: true to install the tdp-spark subchart alongside JupyterHub:

tdpSparkIntegration:
  enabled: true
  deploySparkCluster: true

Adjust the tdp-spark subchart values as needed.

Using Iceberg from Jupyter

Iceberg support in Jupyter is not a separate integration of the tdp-jupyter chart. In practice, it happens via Spark:

the notebook connects to Spark;
Spark needs to know the Iceberg catalog;
the Iceberg catalog needs access to Hive Metastore and S3/MinIO storage.

Therefore:

Jupyter configuration is on this page;
Iceberg catalog configuration is in Integrations — Iceberg;
Spark configuration is in Integrations — Spark.

note

Do not treat Iceberg as mandatory for Jupyter. It is just an additional scenario for notebooks that need to query or maintain Iceberg tables via Spark.

Using Delta Lake from Jupyter

Delta Lake support in Jupyter is also not a separate integration of the tdp-jupyter chart — like Iceberg, it happens via Spark:

the notebook connects to Spark;
the deltaLake block of the tdp-spark chart enables support, but does not configure spark.sparkConf on its own — the required Spark properties are provided via customSparkConfig.properties or spark.sparkConf;
unlike Iceberg, Delta Lake tables do not go through Hive Metastore — access is direct to the paths in S3/MinIO storage.

Therefore:

Jupyter configuration is on this page;
Delta Lake block configuration is in Integrations — Delta Lake;
Spark configuration is in Integrations — Spark.

Install or upgrade JupyterHub

Terminal input
helm upgrade --install <RELEASE_NAME> \
  oci://registry.tecnisys.com.br/tdp/charts/tdp-jupyter \
  -n <NAMESPACE> \
  -f values.yaml

After upgrades

Whenever you modify ConfigMaps or environment variables, restart user pods (Stop Server → Start Server in JupyterHub) for the new settings to take effect.

Verification checklist

Pods running

Terminal input

kubectl get pods -n <NAMESPACE> | grep jupyter

Spark ConfigMap created

Terminal input

kubectl get configmap tdp-jupyter-spark-integration -n <NAMESPACE> -o yaml

Network connectivity (from a notebook pod)

Terminal input

kubectl exec -n <NAMESPACE> <POD_NAME> -- \
  curl -sv tdp-spark-master-svc.<NAMESPACE>.svc.cluster.local:7077

Test the integration

Test notebook included in the chart

The chart includes a test notebook (tdp-jupyter-spark-test ConfigMap). To extract it:

Terminal input
kubectl get configmap tdp-jupyter-spark-test -n <NAMESPACE> \
  -o jsonpath='{.data.spark-integration-test\.ipynb}' \
  > spark-integration-test.ipynb

Upload the notebook through JupyterLab and run each cell.

Manual smoke test

Run the following code in a notebook to validate the integration:

import os
from pyspark.sql import SparkSession

print("SPARK_HOME:", os.environ.get("SPARK_HOME"))
print("SPARK_MASTER_URL:", os.environ.get("SPARK_MASTER_URL"))

spark = SparkSession.builder.appName("TDP-Jupyter Smoke Test").getOrCreate()
print("Spark version:", spark.version)
print("Active master:", spark.sparkContext.master)

spark.range(5).show()
spark.stop()

Troubleshooting

Symptom	Likely cause	Suggested action
`JAVA_GATEWAY_EXITED` or Py4J errors	`SPARK_HOME`/`PYTHONPATH` misconfigured	Ensure `singleuser.extraEnv` uses `/opt/bitnami/spark` paths
`IllegalStateException: Cannot call methods on a stopped SparkContext`	Spark master unreachable or NetworkPolicy blocking egress/ingress	Confirm `tdpSparkIntegration.enabled`, check Spark service, adjust NetworkPolicies
Notebook pod fails to start (`ImportError` for `zmq`)	`PYTHONPATH` polluted with PySpark site-packages	Do not append `/opt/conda/envs/py312/lib/python3.12/site-packages` to `PYTHONPATH`
Spark driver cannot bind/communicate	`SPARK_DRIVER_HOST` not resolvable	Leave blank to use the notebook admin service or supply a reachable DNS entry
Workers cannot reach the driver (`Connecting to /<ip>:2222 timed out`)	The single-user pod NetworkPolicy is blocking ingress from Spark Worker pods	Add the `ingress` rule described in Security — JupyterLab and upgrade the release
`java.net.UnknownHostException: <pod-name>`	`spark.driver.host` is resolving to the pod hostname instead of its IP	Ensure `spark.driver.host` is empty in `sparkConfig` and that `SPARK_DRIVER_HOST` is injected via Downward API (`fieldPath: status.podIP`)
`CANNOT_MODIFY_CONFIG` warnings	Spark configuration applied via `SparkSession.builder.config()` after PySpark import	Pass `spark.driver.host` and JARs via `PYSPARK_SUBMIT_ARGS` before importing PySpark, not via `SparkSession.builder.config()`

Diagnostic commands

Terminal input
# Notebook pod logs
kubectl logs -n <NAMESPACE> <POD_NAME>

# Spark environment variables inside the pod
kubectl exec -n <NAMESPACE> <POD_NAME> -- env | grep SPARK

# List mounted files
kubectl exec -n <NAMESPACE> <POD_NAME> -- ls -R /opt/bitnami/spark/conf

# Check Spark master service endpoints
kubectl get svc -n <NAMESPACE> | grep spark-master

Advanced customization

Add extra Spark properties under tdpSparkIntegration.configMap.sparkConfig.
Define notebook size presets via singleuser.profileList and adjust per-profile Spark environment variables.
When running multiple Spark clusters, override spark.master per profile or via user environment.

Cleanup

Terminal input

helm uninstall <RELEASE_NAME> -n <NAMESPACE>
kubectl delete configmap <RELEASE_NAME>-spark-integration -n <NAMESPACE>

Integration overview​

Spark integration​

Operating modes​

Components involved​

Environment variables injected into notebook pods​

Volumes mounted on notebook pods​

How to configure​

Mode 1 — Local PySpark (default)​

Mode 2 — External Spark cluster​

NetworkPolicy considerations​

Known limitations​

Mode 3 — Bundled Spark cluster (optional)​

Using Iceberg from Jupyter​

Using Delta Lake from Jupyter​

Install or upgrade JupyterHub​

Verification checklist​

Test the integration​

Test notebook included in the chart​

Manual smoke test​

Troubleshooting​

Diagnostic commands​

Advanced customization​

Cleanup​

Integration overview

Spark integration

Operating modes

Components involved

Environment variables injected into notebook pods

Volumes mounted on notebook pods

How to configure

Mode 1 — Local PySpark (default)

Mode 2 — External Spark cluster

NetworkPolicy considerations

Known limitations

Mode 3 — Bundled Spark cluster (optional)

Using Iceberg from Jupyter

Using Delta Lake from Jupyter

Install or upgrade JupyterHub

Verification checklist

Test the integration

Test notebook included in the chart

Manual smoke test

Troubleshooting

Diagnostic commands

Advanced customization

Cleanup