Skip to content

What is the Hadoop FrameworkLink

Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules:

  1. Storage, principally employing the Hadoop File System (HDFS)
  2. Resource management and scheduling for computational tasks, using YARN (Yet Another Resource Negotiator)
  3. Distributed processing programming model based on MapReduce
  4. Common utilities and software libraries necessary for the entire Hadoop platform

The Hadoop EcosystemLink

The Hadoop Ecosystem

  • Core Components
    • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
      • WebHDFS (REST-Api)
    • Yet Another Resource Negotiator (YARN)
      • Slider: A tool to deploy applications on a YARN cluster, monitor them and make them larger or smaller as desired even while the application is running.
  • Cluster Support Services
    • ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
    • Quorum Journal Manager
    • Timeline Server: Maintains historical state and provides metrics visibility for YARN applications.
  • Core Execution Engines
    • MapReduce: A YARN-based system for parallel processing of large data sets.
    • Tez: A generalized data-flow programming framework.
    • Spark: A fast and general compute engine for Hadoop data.
      • Streaming: Stream processing.
      • MLib: Spark's machine learning library.
  • Advanced Execution Engines
    • Pig: A high-level data-flow language and execution framework for parallel computation.
      • DataFU: A collection of user-defined functions for Apache Pig.
    • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
    • Mahout: A Scalable machine learning and data mining library.
  • Data Processing
    • Flink: A stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
    • Giraph: An iterative graph processing system built for high scalability.
    • Storm: A stream processing framework.
    • Solr: An open source enterprise search platform.
    • Elasticsearch: A search engine based on Lucene.
    • Drill: A framework for interactive analysis of large-scale datasets.
    • Impala: A SQL query engine for data stored on a Hadoop cluster.
  • Data Storage
    • HBase: A distributed database that runs on top of HDFS.
      • Phoenix: A relational database engine on top of HBase.
    • Accumolo: A distributed key-value store on top HDFS.
  • Data Ingestion
    • Sqoop: A tool to transfer bulk data between Hadoop and structured data stores (e.g. relational databases).
    • Flume: A tool for collecting, aggregating and moving large amounts of log data from many sources to a centralized data store.
    • Kafka: A general purpose distributed publish-subscribe messaging system.
    • NFS Gateway
  • Data Governance, Workflow Management
    • Oozie
    • Falcon
  • Data Security
    • Knox
    • Ranger
    • Sentry
  • User Interfaces
    • Hue (Hadoop User Experience): A Web interface that supports Apache Hadoop and its ecosystem
    • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
      • Ganglia: A scalable distributed monitoring system for high-performance computing systems.
      • Nagios: A network and server monitoring system.

Common TerminologyLink

  • Node
    • a simple computer, typically non-enterprise, commodity hardware
  • Rack
    • a collection of nodes physically stored close together, connected to the same network switch
    • network bandwidth between two nodes in the same rack is greater than network bandwidth between two nodes on different racks
  • Cluster
    • a collection of clusters

Hadoop cluster