Concepts
The TDP Platform is composed of the main Open Source Big Data Ecosystem software. Check here the conceptual information of all available components.
Airflow - Pipeline Orchestration
Airflow offers important features that make it especially suitable for implementing efficient, batch-oriented data pipelines.
Ambari - Centralized Administration
Apache Ambari enables automated deployment, service and node (host) management, system-wide environment monitoring, configuration versioning, and more.
Atlas - Data Governance
Apache Atlas allows organizations to build a catalog of their assets, classify them, manage them, and provide collaboration capabilities for their use by data scientists and governance teams.
Druid - Real-Time Data Analytics
Designed for fast OLAP queries on large datasets, Druid enhances use cases where real-time ingestion, fast query performance, and high productivity are important.
Flink - Distributed Data Processing
Apache Flink is a distributed processing framework and engine for stateful computation on bounded and unbounded data streams.
GX - Great Expectations - Data Quality Assurance
Great Expectations (GX) is an open-source tool designed to make the data validation and quality assurance process more accessible, automated, and efficient.
Hadoop
Apache Hadoop defines an architecture for distributed and parallel processing of large data volumes across several servers, using simple programming models.
HDFS - Distributed File System
HDFS (Hadoop Distributed File System) is the main storage system used by Hadoop.
MapReduce - Distributed Data Processing Framework
Apache MapReduce is a framework developed to write applications that process large amounts of data across big clusters of commodity hardware.
YARN - Resource Manager
Apache YARN manages resources among clusters and is responsible for scheduling and resource allocation in the Hadoop system.
HBase - Distributed NoSQL
HBase provides efficient random access and real-time read/write on large distributed datasets.
Hive - Data Exploration and Analysis
Hive facilitates data-warehouse users with SQL knowledge to query large datasets without having to learn Java or other languages.
Iceberg - Table Format
Apache Iceberg allows real-time access to historical data in a cohesive manner, ensuring data integrity and consistency.
Kafka - Data Streaming
Kafka is the most commonly used event streaming platform for collecting, processing, storing, and integrating data at scale.
Kerberos - Authentication and Identity Propagation
Kerberos is an open-source network authentication protocol that enables Single Sign-On (SSO) through a trusted mutual authentication service. It is used in Hadoop clusters for secure access.
Knox - Gateway and Unified Access
Apache Knox acts as an application proxy at the perimeter layer, receiving client requests and forwarding them to the appropriate service.
Livy - Spark Session Management
Livy provides a simple way to interact with an Apache Spark cluster via a REST API, enabling applications to submit and manage Spark jobs remotely.
NiFi - Data Flow Management and Automation
NiFi automates dataflows, ensuring compliance, privacy, and security in data exchange between systems.
Ozone - Mass Data Storage
Ozone is a redundant and distributed object storage optimized for Big Data workloads. It is designed for scalability and can handle billions of objects.
Ranger - Authorization and Auditing
Apache Ranger is a framework for enabling, monitoring, and managing data security across the Hadoop platform.
Ranger-KMS - Cryptographic Key Management
It is a solution designed to manage and protect encryption keys, ensuring the confidentiality and integrity of sensitive data.
Spark - Distributed Computing
Apache Spark is a unified analytics engine for large-scale distributed data processing.
Sqoop - Data Ingestion
Apache Sqoop is a CLI tool for bulk data transfer between Apache Hadoop and structured datastores such as relational databases.
Solr - Search and Text Indexing
Apache Solr is an enterprise search platform built on Apache Lucene, designed for document retrieval.
Superset - BI and Data Visualization
Apache Superset is a lightweight and robust web BI tool for exploring and visualizing data, with features for users of all skill levels.
Trino - Distributed SQL Query Engine
Trino is a distributed SQL engine for running complex queries on large volumes of data stored across multiple sources without moving or duplicating it.
Zeppelin - Collaborative Notebook
Apache Zeppelin is a web-based notebook for interactive data ingestion, exploration, visualization, and collaboration with Hadoop and Spark.
Zookeeper - Distributed Coordination
Zookeeper is a centralized open-source service for distributed application coordination. It provides simple APIs for building high-level distributed services.