Concepts

The TDP Platform is composed of the main Open Source Big Data Ecosystem software. Check here the conceptual information of all available components.

Airflow - Pipeline Orchestration

Airflow offers important features that make it especially suitable for implementing efficient, batch-oriented data pipelines.

Ambari - Centralized Administration

Apache Ambari enables automated deployment, service and node (host) management, system-wide environment monitoring, configuration versioning, and more.

Atlas - Data Governance

Apache Atlas allows organizations to build a catalog of their assets, classify them, manage them, and provide collaboration capabilities for their use by data scientists and governance teams.

Delta Lake - Optimized Table Format

Transactional layer operating over a data lake, adding ACID transactions, versioning, schema management, and integration with Spark.

Druid - Real-Time Data Analytics

Designed for fast OLAP queries on large datasets, Druid enhances use cases where real-time ingestion, fast query performance, and high productivity are important.

Flink - Distributed Data Processing

Apache Flink is a distributed processing framework and engine for stateful computation on bounded and unbounded data streams.

Great Expectations (GX) - Data Quality Assurance

Great Expectations (GX) is an open-source tool designed to make the data validation and quality assurance process more accessible, automated, and efficient.

Hadoop

Apache Hadoop defines an architecture for distributed and parallel processing of large data volumes across several servers, using simple programming models.

HDFS - Distributed File System

HDFS (Hadoop Distributed File System) is the main storage system used by Hadoop.

MapReduce - Distributed Data Processing Framework

Apache MapReduce is a framework developed to write applications that process large amounts of data across big clusters of commodity hardware.

YARN - Resource Manager

Apache YARN manages resources among clusters and is responsible for scheduling and resource allocation in the Hadoop system.

HBase - Distributed NoSQL

HBase provides efficient random access and real-time read/write on large distributed datasets.

Hive - Data Exploration and Analysis

Hive facilitates data-warehouse users with SQL knowledge to query large datasets without having to learn Java or other languages.

Iceberg - Optimized Table Format

Apache Iceberg allows real-time access to historical data in a cohesive manner, ensuring data integrity and consistency.

Kafka - Data Streaming

Kafka is the most commonly used event streaming platform for collecting, processing, storing, and integrating data at scale.

Kerberos - Authentication and Identity Propagation

Kerberos is an open-source network authentication protocol that enables Single Sign-On (SSO) through a trusted mutual authentication service. It is used in Hadoop clusters for secure access.

Knox - Gateway and Unified Access

Apache Knox acts as an application proxy at the perimeter layer, receiving client requests and forwarding them to the appropriate service.

Livy - Spark Session Management

Livy provides a simple way to interact with an Apache Spark cluster via a REST API, enabling applications to submit and manage Spark jobs remotely.

NiFi - Data Flow Management and Automation

NiFi automates dataflows, ensuring compliance, privacy, and security in data exchange between systems.

Ozone - Massive Data Storage

Ozone is a redundant and distributed object storage optimized for Big Data workloads. It is designed for scalability and can handle billions of objects.

Ranger - Authorization and Auditing

Apache Ranger is a framework for enabling, monitoring, and managing data security across the Hadoop platform.

Ranger KMS - Cryptographic Key Management

It is a solution designed to manage and protect encryption keys, ensuring the confidentiality and integrity of sensitive data.

Spark - Distributed Computing

Apache Spark is a unified analytics engine for large-scale distributed data processing.

Sqoop - Data Ingestion

Apache Sqoop is a CLI tool for bulk data transfer between Apache Hadoop and structured datastores such as relational databases.

Solr - Search and Text Indexing

Apache Solr is an enterprise search platform built on Apache Lucene, designed for document retrieval.

Superset - BI and Data Visualization

Apache Superset is a lightweight and robust web BI tool for exploring and visualizing data, with features for users of all skill levels.

Trino - Distributed SQL Query Engine

Trino is a distributed SQL engine for running complex queries on large volumes of data stored across multiple sources without moving or duplicating it.

Zeppelin - Collaborative Notebook

Apache Zeppelin is a web-based notebook for interactive data ingestion, exploration, visualization, and collaboration with Hadoop and Spark.

Zookeeper - Distributed Coordination

Zookeeper is a centralized open-source service for distributed application coordination. It provides simple APIs for building high-level distributed services.