Tuesday 29 September 2015

A very small introduction to hadoop and map reduce.. Must read article

Hadoop was created by the Apache foundation as an open-source software framework capable of processing large amounts of heterogeneous data-sets in a distributed fashion (via MapReduce) across clusters of commodity hardware on a storage framework (HDFS).  Hadoop uses a simplified programming model.  The result is Hadoop provides a reliable shared storage and analysis system.
MapReduce is a software framework that allows developers to write programs that perform complex computations on massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.  MapReduce libraries have been written in many programming languages (usually Java), with different levels of optimization.  It works by breaking down a large complex computation into multiple tasks and assigning those tasks to individual worker/slave nodes and taking care of coordination and consolidation of the results.  A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.  When data is pushed to HDFS, it will automatically split into multiple blocks (128MB by default) and stores/replicates the data across various datanodes, ensuring high availability and fault tolerance. 

NameNode holds the information about all the other nodes in the Hadoop cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of  the Hadoop cluster.  Each DataNode is responsible for holding the data.  JobTracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.  TaskTracker is responsible for running the task/computation assigned to it.

No comments:

Post a Comment