What is the Hadoop FrameworkLink
Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules:
- Storage, principally employing the Hadoop File System (HDFS)
- Resource management and scheduling for computational tasks, using YARN (Yet Another Resource Negotiator)
- Distributed processing programming model based on MapReduce
- Common utilities and software libraries necessary for the entire Hadoop platform
The Hadoop EcosystemLink
- Core Components
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- WebHDFS (REST-Api)
- Yet Another Resource Negotiator (YARN)
- Slider: A tool to deploy applications on a YARN cluster, monitor them and make them larger or smaller as desired even while the application is running.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Cluster Support Services
- ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Quorum Journal Manager
- Timeline Server: Maintains historical state and provides metrics visibility for YARN applications.
- Core Execution Engines
- MapReduce: A YARN-based system for parallel processing of large data sets.
- Tez: A generalized data-flow programming framework.
- Spark: A fast and general compute engine for Hadoop data.
- Streaming: Stream processing.
- MLib: Spark's machine learning library.
- Advanced Execution Engines
- Pig: A high-level data-flow language and execution framework for parallel computation.
- DataFU: A collection of user-defined functions for Apache Pig.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout: A Scalable machine learning and data mining library.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- Data Processing
- Flink: A stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
- Giraph: An iterative graph processing system built for high scalability.
- Storm: A stream processing framework.
- Solr: An open source enterprise search platform.
- Elasticsearch: A search engine based on Lucene.
- Drill: A framework for interactive analysis of large-scale datasets.
- Impala: A SQL query engine for data stored on a Hadoop cluster.
- Data Storage
- HBase: A distributed database that runs on top of HDFS.
- Phoenix: A relational database engine on top of HBase.
- Accumolo: A distributed key-value store on top HDFS.
- HBase: A distributed database that runs on top of HDFS.
- Data Ingestion
- Sqoop: A tool to transfer bulk data between Hadoop and structured data stores (e.g. relational databases).
- Flume: A tool for collecting, aggregating and moving large amounts of log data from many sources to a centralized data store.
- Kafka: A general purpose distributed publish-subscribe messaging system.
- NFS Gateway
- Data Governance, Workflow Management
- Oozie
- Falcon
- Data Security
- Knox
- Ranger
- Sentry
- User Interfaces
- Hue (Hadoop User Experience): A Web interface that supports Apache Hadoop and its ecosystem
- Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
- Ganglia: A scalable distributed monitoring system for high-performance computing systems.
- Nagios: A network and server monitoring system.
Common TerminologyLink
- Node
- a simple computer, typically non-enterprise, commodity hardware
- Rack
- a collection of nodes physically stored close together, connected to the same network switch
- network bandwidth between two nodes in the same rack is greater than network bandwidth between two nodes on different racks
- Cluster
- a collection of clusters