Share my learning's: 1)Apache Hadoop

Tuesday, 1 August 2017

1)Apache Hadoop

Apache Hadoop:
Hadoop is an open source, Java-based programming framework that supports the distributed storage and parallel processing of extremely large data sets across clusters of computers in a distributed computing environment.

Fundamental terms of Hadoop environment:

1)Hadoop Cluster:
A set/group of machines are running HDFS and MapReduce programs is known as Hadoop Cluster.
A cluster can have as many as or few nodes,

More nodes = better performance!

2)Node:
An individual machines in cluster are known as Nodes.

Hadoop core components:
Hadoop consists of two core components

Hadoop Distributed File System (HDFS):
*A distributed file system for Hadoop Storage unit, that provides high-availability access to application data.
*Responsible for storing data on the cluster.
*Data is split into blocks and distributed across multiple nodes in the cluster.
*Each block is typically 64MB(Hadoop 1.x) or 128 MB(Hadoop 2.x) by default.

Key Feature of HDFS:
To ensure reliability and availability of data, Each block is replicated multiple times(by default three times) across different nodes in a cluster.

Note: HDFS (write once read many)

Hadoop MapReduce:
MapReduce is used to process the data in the Hadoop Cluster parallel on the large data sets

MapReduce consists of two task:
i)Map task
ii)Reduce task
Between two tasks, there is shuffle and sort task, this will process automatically.

i)Map task:
Each Map task operates on a discrete/individual portion of the overall dataset, i.e., one HDFS block of data.
After all Map task completes, it sends an intermediate key/value pairs to Reduce task

ii)Reduce task:
Reduce task is like an aggregation task for the keys generated by the map task.

New component on top of MapReduce:
Hadoop YARN:
A framework for job scheduling and cluster resource management.

Share my learning's

Tuesday, 1 August 2017

1)Apache Hadoop

No comments:

Post a Comment

Fundamentals of Python programming