Share my learning's: Hadoop

Apache Hadoop:

Hadoop is a Open source software framework which allows us to store and process large data sets in parallel and distributed fashion.

Note:
The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

We might wonder, Why Hadoop for BigData??
Below are the reasons of it,

Distributed Computing
Parallel processing
Low latency
Data availability
Fault tolerance
Cross platform

There are two parts of Hadoop:

Hadoop Distributed File System(HDFS)
MapRedue

1)Hadoop Distributed File System(HDFS):
The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.

Three main HDFS concepts are,
1. Sharding
2. Distributed
3. Replicated Factor

1.Sharding :

Sharding means single file splited into multiple blocks.

Highlights:

In Hadoop version 1, Default block size is 64 Mb
In Hadoop version 2, Default block size is 128 Mb
But the block size is not a fied size, we can increase or decrease the block size.

Note:
the block size can be set and change in hdfs-site.xml file

2.Distributed:
Each block of block from the sharding, it stores across the cluster on different machines

3.Replicated Factor:

Each block are replicated(copies) among the cluster.

Default replication is 3.
Minimum number of replication is 1.
Maximun number of relication is 512.

Note:
<dfs.replication> - here we can change the replication factor.

These are the reason with usage of HDFS concepts:

Data Availability

Fault Tolerance

- its work based on replication factor

Parallel Processing

- it work based on distribution.

Note:
HDFS follows the MASTER/SLAVE architecture,

HDFS Core components:(Daemons)

NameNode
DataNode
Secondary NameNode

i)NameNode:

It is a Master Node.

Maintains and manages all the datanodes.

It stores only Metadata not an actual data. i.e., information about data blocks, Ex: location of blocks, size of file,permissios etc.,

Receives heartbeat and block reports every 3 senconds

By default, only one namenode. so it is a single point of failure, CHA-High availabilty (HA)--> Active -->standby

NameNode stores metadata in two files:

FsImage - stores all changes and modifications in HARDDISK
editsLog - recent changes info, in RAM

ii)DataNode:

It is Slave Node.
Stores actual data.
Serves read and write from the clients.

Note:

Version 1 – 4000 datanoes.
Version 2 - 35000 datanodes.

Highlight:

If the heart beat signal is not received in 3sec, then the namenode will wait for 10 mins. If the signal is not recived yet 10 mins then it declare the datanode is deadnode.

iii)Secondary NameNode:

It is the checkpoint node, to combine the ediLog and FsImage file to form persistenced FsImage,
it does periodically (defualt 1 hr)
to make NameNode to be avaible with recent changes

2)MapRedue:

MapReduce is a programming framework that allows us to perform distributed and parallel processing of large data sets in distributed environment.

Note:

Process like Divide & Conquer method (Divide the job & combine the result).
Input / Output for MapReduce --> HDFS ( Read the data from HDFS & Write the output to HDFS ).

Version 1 - mapreduce daemons:

JobTracker
TaskTracker

iv)JobTracker:

Co-ordinate,
Failure handling,
Resource management,
Scheduling.

v)TaskTracker:

It perform a task, which is assigned by the jobtracker.

Version 2 - Yarn daemons (yet another resource negotiator):

ResourceManger
NodeManager

iv)ResourceManger:

Role of jobtracker.

v)NodeManager:

Role of tasktracker.

Why we use MapReduce:

To process the unstructed data to structured data for analysis,
Large datasets processing can be done in parallel and distributed

Four task in MapReduce:

Map Task
Reducer Task
Combiner Task
Partitioner Task

1.Map Task:

Assigns the task to the worker.
Number of mapper = Number of blocks/ Input split.

2. Reducer Task:

To perform Aggragate Functions.
Number of reducer = Number of output files.
Select e_name, e_id from employee --> Map only Job.
Select e_name, e_id, sum (e_salary) from employee --> Map & Reduce Job.

3. Combiner Task:

Local reducer inside Mapper Class

4. Partitioner Task:

Data split to increase the performance

Two types of partitions:
Default Partition: It follows based on Hash value
Custom Partition:We decide on which data to be partitioned.

Note:
Both for Optimize

Previous Page Next Page:

Share my learning's

Saturday, 2 September 2017

Hadoop

No comments:

Post a Comment

Fundamentals of Python programming