Skip to main content
Version Next

Hive Metastore Configuration

The tdp-hive-metastore chart packages Apache Hive Metastore 4.0.0 for Kubernetes.

What is Hive Metastore and why does it exist?

Apache Hive Metastore is the metadata catalog service of the TDP.

It answers a fundamental question: when Spark or Trino want to read a table called sales, how do they know where that table's files are in S3?

What is the schema (columns and types)? Which partitions exist?

Hive Metastore stores exactly this information: the mapping between table names and physical locations in object storage (S3/Ozone).

Without it, Spark and Trino cannot discover tables by name — they would need to know the exact path of every file.

Learn more

See Apache Hive — Concepts for a complete overview of the tool and its role in the data ecosystem.

How Hive Metastore fits into TDP

Spark / Trino

│ "Where is the 'sales' table?"

Hive Metastore ──► PostgreSQL (stores metadata)

│ "It's at s3a://warehouse/sales/"

Apache Ozone / S3 (where the actual data lives)

Hive Metastore is an infrastructure service, not an end-user interface.

You rarely interact with it directly — it operates in the background so that Spark and Trino work correctly with tables in Hive, Iceberg, and Delta Lake formats.

Deployed components

ComponentDescription
Metastore ServerThrift service on port 9083 that responds to metadata queries
PostgreSQL (internal or external)Database that stores schemas, tables, partitions, and locations

Prerequisites

  • Kubernetes and Helm compatible with the chart
  • If using external PostgreSQL: an instance accessible from the cluster

Installation (OCI)

Terminal input
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-hive-metastore \
-n <namespace> --create-namespace

PostgreSQL requirements

Hive Metastore requires sufficient connection capacity in PostgreSQL:

RequirementMinimumRecommendedExample (TDP PostgreSQL)
max_connections300400+400

If using external PostgreSQL, ensure adequate limits and network access; the TDP PostgreSQL chart documents max_connections = 400 as an example in postgresql.conf.

Hive Metastore needs a relational database to store table metadata (schemas, locations, partitions).

The chart supports an embedded or external database — to understand when to use each, see Internal vs. external PostgreSQL in the General Configuration.

Internal database (embedded PostgreSQL)

tdp-hive-metastore:
postgres:
enabled: true

internal:
image: "postgres:15"
pvcSize: 1Gi
maxConnections: 500
sharedBuffers: 512MB
effectiveCacheSize: 1.2GB
maintenanceWorkMem: 128MB
workMem: 32MB
walBuffers: 16MB

database: hive
username: hive
password: "<POSTGRES_PASSWORD>"

External database

Set tdp-hive-metastore.postgres.enabled=false and configure tdp-hive-metastore.postgres.external.*.

tdp-hive-metastore:
postgres:
enabled: false
external:
host: "<POSTGRESQL_HOST>"
port: 5432

TDPConfigurations:
externalDatabase:
enabled: true
recreate: false
externalSecret:
releaseName: "<postgresql-release>"

The hook jobs reuse TDPConfigurations.externalDatabase: they authenticate to the upstream PostgreSQL using the Secret associated with the name in externalSecret.releaseName (key postgres-password, TDP standard).

  • Use recreate=true only when you explicitly want to recreate the Hive database/user; for upgrades that preserve data, keep recreate=false (recommended default).

CLI overrides (example):

Terminal input
helm upgrade --install <release> oci://registry.tecnisys.com.br/tdp/charts/tdp-hive-metastore \
-n <namespace> --create-namespace \
--set TDPConfigurations.externalDatabase.recreate=false \
--set TDPConfigurations.externalDatabase.externalSecret.releaseName=<postgresql-release>

Main parameters

ParameterDescriptionDefault / example
namespaceDeployment namespaceAs per the Helm release
postgres.enabledInternal PostgreSQLtrue
postgres.external.hostExternal PostgreSQL hostExample service DNS
metastore.typeMetastore type (file or s3)file

Metastore resources

Typical reference values: memory 1Gi request / 2Gi limit, CPU 500m request / 1000m limit, configurable persistent volume (confirm with helm show values).

Verify installation

Terminal input
kubectl -n <namespace> get all

Expect a Deployment/Service for the metastore (and for the internal PostgreSQL if postgres.enabled=true), with names derived from <release>.

Service override (hive-service-override.yaml)

Compared to the upstream chart, the Deployment labels pods with app: {{ .Release.Name }}-metastore, while the upstream Service may use a fixed selector that is incompatible. The TDP chart template aligns the selector so the Service has endpoints.

Verification:

Terminal input
kubectl get endpoints -n <namespace> <release>-metastore

If <none> appears, the selector does not match the pod labels.

Local test:

Terminal input
kubectl port-forward -n <namespace> svc/<release>-metastore 9083:9083

Hive database credentials (optional)

Terminal input
HIVE_PASSWORD=$(kubectl get secret --namespace <namespace> \
<release>-hive-database -o jsonpath="{.data.password}" | base64 -d)

kubectl run postgres-client --rm --tty -i --restart='Never' \
--namespace <namespace> --image postgres:14 \
--env="PGPASSWORD=$HIVE_PASSWORD" \
--command -- psql -h <postgres-service> -p 5432 -U hive -d hive

Replace <postgres-service> with the hostname of the PostgreSQL the metastore uses (internal or external).

Troubleshooting

  1. Connection pool exhaustion: max_connections ≥ 300 in PostgreSQL.
  2. Authentication: verify secrets and mounts.
  3. Storage: check PV permissions.
  4. Network: ensure the PostgreSQL service is reachable.

Useful commands (adapting release/pod names):

Terminal input
kubectl -n <namespace> exec <postgresql-pod> -- psql -c "SHOW max_connections;"
kubectl -n <namespace> exec <postgresql-pod> -- psql -c "SELECT count(*) FROM pg_stat_activity;"
kubectl -n <namespace> logs deployment/<release>-metastore