Apache Hadoop:
Hadoop is an open source, Java-based programming framework that supports the distributed storage and parallel processing of extremely large data sets across clusters of computers in a distributed computing environment.
Fundamental terms of Hadoop environment:
1)Hadoop Cluster:
A set/group of machines are running HDFS and MapReduce programs is known as Hadoop Cluster.
A cluster can have as many as or few nodes,
2)Node:
An individual machines in cluster are known as Nodes.
Hadoop core components:
Hadoop consists of two core components
Hadoop Distributed File System (HDFS):
*A distributed file system for Hadoop Storage unit, that provides high-availability access to application data.
*Responsible for storing data on the cluster.
*Data is split into blocks and distributed across multiple nodes in the cluster.
*Each block is typically 64MB(Hadoop 1.x) or 128 MB(Hadoop 2.x) by default.
Key Feature of HDFS:
To ensure reliability and availability of data, Each block is replicated multiple times(by default three times) across different nodes in a cluster.
Note: HDFS (write once read many)
Hadoop MapReduce:
MapReduce is used to process the data in the Hadoop Cluster parallel on the large data sets
MapReduce consists of two task:
i)Map task
ii)Reduce task
Between two tasks, there is shuffle and sort task, this will process automatically.
i)Map task:
Each Map task operates on a discrete/individual portion of the overall dataset, i.e., one HDFS block of data.
After all Map task completes, it sends an intermediate key/value pairs to Reduce task
ii)Reduce task:
Reduce task is like an aggregation task for the keys generated by the map task.
New component on top of MapReduce:
Hadoop YARN:
A framework for job scheduling and cluster resource management.
Hadoop is an open source, Java-based programming framework that supports the distributed storage and parallel processing of extremely large data sets across clusters of computers in a distributed computing environment.
Fundamental terms of Hadoop environment:
1)Hadoop Cluster:
A set/group of machines are running HDFS and MapReduce programs is known as Hadoop Cluster.
A cluster can have as many as or few nodes,
More nodes = better performance!
2)Node:
An individual machines in cluster are known as Nodes.
Hadoop core components:
Hadoop consists of two core components
Hadoop Distributed File System (HDFS):
*A distributed file system for Hadoop Storage unit, that provides high-availability access to application data.
*Responsible for storing data on the cluster.
*Data is split into blocks and distributed across multiple nodes in the cluster.
*Each block is typically 64MB(Hadoop 1.x) or 128 MB(Hadoop 2.x) by default.
Key Feature of HDFS:
To ensure reliability and availability of data, Each block is replicated multiple times(by default three times) across different nodes in a cluster.
Note: HDFS (write once read many)
Hadoop MapReduce:
MapReduce is used to process the data in the Hadoop Cluster parallel on the large data sets
MapReduce consists of two task:
i)Map task
ii)Reduce task
Between two tasks, there is shuffle and sort task, this will process automatically.
i)Map task:
Each Map task operates on a discrete/individual portion of the overall dataset, i.e., one HDFS block of data.
After all Map task completes, it sends an intermediate key/value pairs to Reduce task
ii)Reduce task:
Reduce task is like an aggregation task for the keys generated by the map task.
New component on top of MapReduce:
Hadoop YARN:
A framework for job scheduling and cluster resource management.
No comments:
Post a Comment