Skip to main content

Concepts

The TDP Platform is composed of the main Open Source Big Data Ecosystem software. Check here the conceptual information of all available components.

Airflow - Pipeline Orchestration

Airflow offers important features that make it especially suitable for implementing efficient, batch-oriented data pipelines.

Ambari - Centralized Administration

Apache Ambari enables automated deployment, service and node (host) management, system-wide environment monitoring, configuration versioning, and more.

Atlas - Data Governance

Apache Atlas allows organizations to build a catalog of their assets, classify them, manage them, and provide collaboration capabilities for their use by data scientists and governance teams.

Druid - Real-Time Data Analytics

Designed for fast OLAP queries on large datasets, Druid enhances use cases where real-time ingestion, fast query performance, and high productivity are important.

Flink - Distributed Data Processing

Apache Flink is a distributed processing framework and engine for stateful computation on bounded and unbounded data streams.

GX - Great Expectations - Data Quality Assurance

Great Expectations (GX) is an open-source tool designed to make the data validation and quality assurance process more accessible, automated, and efficient.

HBase - Distributed NoSQL

HBase provides efficient random access and real-time read/write on large distributed datasets.

Hive - Data Exploration and Analysis

Hive facilitates data-warehouse users with SQL knowledge to query large datasets without having to learn Java or other languages.

Iceberg - Table Format

Apache Iceberg allows real-time access to historical data in a cohesive manner, ensuring data integrity and consistency.

Kafka - Data Streaming

Kafka is the most commonly used event streaming platform for collecting, processing, storing, and integrating data at scale.

Kerberos - Authentication and Identity Propagation

Kerberos is an open-source network authentication protocol that enables Single Sign-On (SSO) through a trusted mutual authentication service. It is used in Hadoop clusters for secure access.

Knox - Gateway and Unified Access

Apache Knox acts as an application proxy at the perimeter layer, receiving client requests and forwarding them to the appropriate service.

Livy - Spark Session Management

Livy provides a simple way to interact with an Apache Spark cluster via a REST API, enabling applications to submit and manage Spark jobs remotely.

NiFi - Data Flow Management and Automation

NiFi automates dataflows, ensuring compliance, privacy, and security in data exchange between systems.

Ozone - Mass Data Storage

Ozone is a redundant and distributed object storage optimized for Big Data workloads. It is designed for scalability and can handle billions of objects.

Ranger - Authorization and Auditing

Apache Ranger is a framework for enabling, monitoring, and managing data security across the Hadoop platform.

Ranger-KMS - Cryptographic Key Management

It is a solution designed to manage and protect encryption keys, ensuring the confidentiality and integrity of sensitive data.

Spark - Distributed Computing

Apache Spark is a unified analytics engine for large-scale distributed data processing.

Sqoop - Data Ingestion

Apache Sqoop is a CLI tool for bulk data transfer between Apache Hadoop and structured datastores such as relational databases.

Solr - Search and Text Indexing

Apache Solr is an enterprise search platform built on Apache Lucene, designed for document retrieval.

Superset - BI and Data Visualization

Apache Superset is a lightweight and robust web BI tool for exploring and visualizing data, with features for users of all skill levels.

Trino - Distributed SQL Query Engine

Trino is a distributed SQL engine for running complex queries on large volumes of data stored across multiple sources without moving or duplicating it.

Zeppelin - Collaborative Notebook

Apache Zeppelin is a web-based notebook for interactive data ingestion, exploration, visualization, and collaboration with Hadoop and Spark.

Zookeeper - Distributed Coordination

Zookeeper is a centralized open-source service for distributed application coordination. It provides simple APIs for building high-level distributed services.