Hadoop is best known for Map Reduce and it’s Distributed File System (HDFS). Recently other productivity tools developed on top of these will form a complete Ecosystem of Hadoop. Most of the projects are hosted under Apache Software Foundation. Hadoop Ecosystem projects are listed below.
A set of components and interfaces for Distributed File System and I/O (serialization, Java RPC, Persistent data structures)
A distributed file system that runs on large clusters of commodity hardware. Hadoop Distributed File System, HDFS renamed form NDFS. Scalable data store that stores semi-structured, un-structured and structured data.
Map Reduce is the distributed, parallel computing programming model for Hadoop. Inspired from Google Map Reduce research paper. Hadoop includes implementation of Map Reduce programming model. In Map Reduce there are two phases, not surprisingly Map and Reduce. To be precise in between Map and Reduce phase, there is another phase called sort and shuffle. Job Tracker in Name Node machine manages other cluster nodes. Map Reduce programming can be written in Java. If you like SQL or other non- Java languages, you are still in luck. You can use utility called Hadoop Streaming.
A utility to enable Map Reduce code in many languages like C, Perl, Python, C++, Bash etc., Examples include a Python mapper and AWK reducer.
A serialization system for efficient, cross-language RPC and persistent data storage. Avro is a framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro.
Apache Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.
Hive and Hue
If you like SQL, you would be delighted to hear that you can write SQL and Hive convert it to a Map Reduce job. But, you don’t get a full ANSI-SQL environment. Hue gives you a browser based graphical interface to do your Hive work.
Hue features a File Browser for HDFS, a Job Browser for Map Reduce/YARN, an HBase Browser, query editors for Hive, Pig, Cloudera Impala and Sqoop2.It also ships with an Oozie Application for creating and monitoring workflows, a Zookeeper Browser and an SDK.
A high-level programming data flow language and execution environment to do Map Reduce coding The Pig language is called Pig Latin. You may find naming conventions some what un-conventional, but you get incredible price-performance and high availability.
JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. As its name implies, a primary use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data. For example, it can support XML, comma-separated values (CSV) data and flat files. A “SQL within JAQL” capability lets programmers work with structured SQL data while employing a JSON data model that’s less restrictive than its Structured Query Language counterparts.
Sqoop provides a bi-directional data transfer between Hadoop -HDFS and your favorite relational database. For example you might be storing your app data in relational store such as Oracle, now you want to scale your application with Hadoop so you can migrate Oracle database data to Hadoop HDFS using Sqoop.
Manages Hadoop workflow. This doesn’t replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with Hadoop jobs.
A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. It is used to manage synchronization for cluster.
Based on Google’s Bigtable, HBase “is an open-source, distributed, version, column-oriented store” that sits on top of HDFS. A super scalable key-value store. It works very much like a persistent hash-map (for python developers think like a Dictionary). It is not a conventional relational database. It is a distributed, column oriented database. HBase uses HDFS for it’s underlying. Supports both batch-style computations using Map Reduce and point queries for random reads.
A column oriented NoSQL data store which offers scalability, high availability with out compromising on performance. It perfect platform for commodity hardware and cloud infrastructure.Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views, and powerful built-in caching.
A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase.Flume “channels” data between “sources” and “sinks” and its data harvesting can either be scheduled or event-driven. Possible sources for Flume include Avro, files, and system logs, and possible sinks include HDFS and HBase.
Machine Learning for Hadoop, used for predictive analytics and other advanced analysis.
There are currently four main groups of algorithms in Mahout:
- recommendations, a.k.a. collective filtering
- classification, a.k.a categorization
- frequent item set mining, a.k.a parallel frequent pattern mining
Mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion.
Makes the HDFS system to look like a regular file system so that you can use ls, rm, cd etc., directly on HDFS data.
Apache Whirr is a set of libraries for running cloud services.
Whirr provides a cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.A common service API. The details of provisioning are particular to the service. Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
You can also use Whirr as a command line tool for deploying clusters.
An open source graph processing API like Pregel from Google
Chukwa, an incubator project on Apache, is a data collection and analysis system built on top of HDFS and Map Reduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. It is included in the Apache Hadoop distribution as an independent module.
Apache Drill, an incubator project on Apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is available as an IaaS service called Google Big Query. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.
Released by Cloudera, Impala is an open-source project which, like Apache Drill, was inspired by Google’s paper on Dremel; the purpose of both is to facilitate real-time querying of data in HDFS or HBase. Impala uses an SQL-like language that, though similar to HiveQL, is currently more limited than HiveQL. Because Impala relies on the Hive Meta store, Hive must be installed on a cluster in order for Impala to work.
The secret behind Impala’s speed is that it “circumvents Map Reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.” (Source: Cloudera)