Hive Metastore Configuration
The tdp-hive-metastore chart packages Apache Hive Metastore 4.0.0 for Kubernetes.
What is Hive Metastore and why does it exist?
Apache Hive Metastore is the metadata catalog service of the TDP.
It answers a fundamental question: when Spark or Trino want to read a table called sales, how do they know where that table's files are in S3?
What is the schema (columns and types)? Which partitions exist?
Hive Metastore stores exactly this information: the mapping between table names and physical locations in object storage (S3/Ozone).
Without it, Spark and Trino cannot discover tables by name — they would need to know the exact path of every file.
See Apache Hive — Concepts for a complete overview of the tool and its role in the data ecosystem.
How Hive Metastore fits into TDP
Spark / Trino
│
│ "Where is the 'sales' table?"
▼
Hive Metastore ──► PostgreSQL (stores metadata)
│
│ "It's at s3a://warehouse/sales/"
▼
Apache Ozone / S3 (where the actual data lives)
Hive Metastore is an infrastructure service, not an end-user interface.
You rarely interact with it directly — it operates in the background so that Spark and Trino work correctly with tables in Hive, Iceberg, and Delta Lake formats.
Deployed components
| Component | Description |
|---|---|
| Metastore Server | Thrift service on port 9083 that responds to metadata queries |
| PostgreSQL (internal or external) | Database that stores schemas, tables, partitions, and locations |
Prerequisites
- Kubernetes and Helm compatible with the chart
- If using external PostgreSQL: an instance accessible from the cluster
Installation (OCI)
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-hive-metastore \
-n <namespace> --create-namespace
PostgreSQL requirements
Hive Metastore requires sufficient connection capacity in PostgreSQL:
| Requirement | Minimum | Recommended | Example (TDP PostgreSQL) |
|---|---|---|---|
| max_connections | 300 | 400+ | 400 |
If using external PostgreSQL, ensure adequate limits and network access; the TDP PostgreSQL chart documents max_connections = 400 as an example in postgresql.conf.
Hive Metastore needs a relational database to store table metadata (schemas, locations, partitions).
The chart supports an embedded or external database — to understand when to use each, see Internal vs. external PostgreSQL in the General Configuration.
Internal database (embedded PostgreSQL)
tdp-hive-metastore:
postgres:
enabled: true
internal:
image: "postgres:15"
pvcSize: 1Gi
maxConnections: 500
sharedBuffers: 512MB
effectiveCacheSize: 1.2GB
maintenanceWorkMem: 128MB
workMem: 32MB
walBuffers: 16MB
database: hive
username: hive
password: "<POSTGRES_PASSWORD>"
External database
Set tdp-hive-metastore.postgres.enabled=false and configure tdp-hive-metastore.postgres.external.*.
tdp-hive-metastore:
postgres:
enabled: false
external:
host: "<POSTGRESQL_HOST>"
port: 5432
TDPConfigurations:
externalDatabase:
enabled: true
recreate: false
externalSecret:
releaseName: "<postgresql-release>"
The hook jobs reuse TDPConfigurations.externalDatabase: they authenticate to the upstream PostgreSQL using the Secret associated with the name in externalSecret.releaseName (key postgres-password, TDP standard).
- Use
recreate=trueonly when you explicitly want to recreate the Hive database/user; for upgrades that preserve data, keeprecreate=false(recommended default).
CLI overrides (example):
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-hive-metastore \
-n <namespace> --create-namespace \
--set TDPConfigurations.externalDatabase.recreate=false \
--set TDPConfigurations.externalDatabase.externalSecret.releaseName=<postgresql-release>
Main parameters
| Parameter | Description | Default / example |
|---|---|---|
namespace | Deployment namespace | As per the Helm release |
postgres.enabled | Internal PostgreSQL | true |
postgres.external.host | External PostgreSQL host | Example service DNS |
metastore.type | Metastore type (file or s3) | file |
Metastore resources
Typical reference values: memory 1Gi request / 2Gi limit, CPU 500m request / 1000m limit, configurable persistent volume (confirm with helm show values).
Verify installation
kubectl -n <namespace> get all
Expect a Deployment/Service for the metastore (and for the internal PostgreSQL if postgres.enabled=true), with names derived from <release>.
Service override (hive-service-override.yaml)
Compared to the upstream chart, the Deployment labels pods with app: {{ .Release.Name }}-metastore, while the upstream Service may use a fixed selector that is incompatible. The TDP chart template aligns the selector so the Service has endpoints.
Verification:
kubectl get endpoints -n <namespace> <release>-metastore
If <none> appears, the selector does not match the pod labels.
Local test:
kubectl port-forward -n <namespace> svc/<release>-metastore 9083:9083
Hive database credentials (optional)
HIVE_PASSWORD=$(kubectl get secret --namespace <namespace> \
<release>-hive-database -o jsonpath="{.data.password}" | base64 -d)
kubectl run postgres-client --rm --tty -i --restart='Never' \
--namespace <namespace> --image postgres:14 \
--env="PGPASSWORD=$HIVE_PASSWORD" \
--command -- psql -h <postgres-service> -p 5432 -U hive -d hive
Replace <postgres-service> with the hostname of the PostgreSQL the metastore uses (internal or external).
Troubleshooting
- Connection pool exhaustion:
max_connections≥ 300 in PostgreSQL. - Authentication: verify secrets and mounts.
- Storage: check PV permissions.
- Network: ensure the PostgreSQL service is reachable.
Useful commands (adapting release/pod names):
kubectl -n <namespace> exec <postgresql-pod> -- psql -c "SHOW max_connections;"
kubectl -n <namespace> exec <postgresql-pod> -- psql -c "SELECT count(*) FROM pg_stat_activity;"
kubectl -n <namespace> logs deployment/<release>-metastore