Thursday 3 August 2017

7)Hive- Partitioning

Partitioning:

Main concept of partitioning to improve the performance of processing,

Exploration behind the partition concept:
  • Let's consider, if we do a simple query with some conditions using clauses,(select * from table_name where column_name="some_value" ), even though it's simple query, it reads the entire data sets, and this becomes bottleneck for Map Reduce jobs over large data sets,
  • We can overcome this issue by implementing partitions in Hive, Hive makes it very easy to implement the partitions by using the automatic partition scheme when the table is created.
  • Hive Partitioning dividing the large amount of data into number  of pieces of folders based on table columns value.

Note: If you want to use Partitioning in hive then you should use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.

Types of Partition:

Custom Partition:
Custom Partitions are the user defined partitions.

i)Static Partition:
  • The column values are known at COMPILE TIME.
  • Usually when loading files (big files) into Hive tables static partitions are preferred.
  • We can alter the partition in static partition
  • Performing Static partition can be done on Hive Managed table or external table.

Hive table creation:

Step-1: Create a Hive table

CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Step-2: Load value into the Hive table

LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;

DATA SET OF Radio_log.txt:

12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0


1.Using existing column:
In this type of partition, we use the partition column which is already present in the table.

Example:

Step-1: Create a Hive partition table

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION(USER_ID STRING,TRACK_ID INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (IS_SHARED INT);
OK
Time taken: 0.169 seconds

Step-2: Insert value into the Partitioned table
INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;

OUTPUT:

hive> INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807121225_3997eead-24be-4591-a11c-b64274216160
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0057, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0057/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1501077058026_0057
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:12:34,007 Stage-1 map = 0%,  reduce = 0%
2017-08-07 12:12:40,470 Stage-1 map = 100%,  reduce = 0%
MapReduce Total cumulative CPU time: 2 seconds 260 msec
Ended Job = job_1501077058026_0057
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition/is_shared=1/.hive-staging_hive_2017-08-07_12-12-25_461_889701329221204816-1/-ext-10000
Loading data to table test.radio_partition partition (is_shared=1)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.26 sec   HDFS Read: 4693 HDFS Write: 116 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
user_id tack_id radio   is_skipped
Time taken: 17.829 seconds



Step-3: View the Partitione value
hive> SELECT * FROM RADIO_PARTITION
    > ;
OK
radio_partition.user_id radio_partition.track_id        radio_partition.radio   radio_partition.is_skipped      radio_partition.is_shared
34567   521     0       0       1
23456   225     0       0       1
Time taken: 0.142 seconds, Fetched: 2 row(s)

2.Using new column:

In this type of partition, we use new partition column

Example:

Step-1: Create a Partitioned Hive table

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (YEAR STRING);

Step-2: Insert the value into the table

hive> INSERT OVERWRITE TABLE RADIO_PARTITION1 PARTITION (YEAR='2017') SELECT * FROM RADIO;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807122740_fd6767aa-8272-49d9-9034-4c5b4b45d913
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0058, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0058/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1501077058026_0058
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:27:48,486 Stage-1 map = 0%,  reduce = 0%
2017-08-07 12:27:54,886 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
MapReduce Total cumulative CPU time: 1 seconds 470 msec
Ended Job = job_1501077058026_0058
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition1/year=2017/.hive-staging_hive_2017-08-07_12-27-40_364_5142052446115216443-1/-ext-10000
Loading data to table test.radio_partition1 partition (year=2017)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.47 sec   HDFS Read: 4418 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 470 msec
OK
radio.user_id   radio.tack_id   radio.is_shared radio.radio     radio.is_skipped
Time taken: 15.997 seconds




Step-3: View the partitioned value:

hive> SELECT * FROM RADIO_PARTITION1;
OK
radio_partition1.user_id        radio_partition1.track_id       radio_partition1.is_shared      radio_partition1.radio  radio_partition1.is_skipped     radio_partition1.year
12345   521     0       1       0       2017
34567   521     1       0       0       2017
12345   222     0       1       1       2017
23456   225     1       0       0       2017
Time taken: 0.126 seconds, Fetched: 4 row(s)


Dynamic Partition:
*The column values are known at RUN/EXECUTION TIME.
*We use dynamic partition while loading from an existing table that is not partitioned.
*We can’t perform alter on Dynamic partition.
*Perform dynamic partition on hive external table and managed table.

NOTE:In dynamic partition the partitioned column of the partitioned hive table is must present in the last column of the existing hive table.

Hive table creation:

Step-1: Create a Hive table

hive> CREATE TABLE IF NOT EXISTS RADIO1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.029 seconds

Step-2: Load value into the Hive table

hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO1;
Loading data to table test.radio1
OK
Time taken: 0.273 seconds

DATA SET OF Radio_log.txt:
12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0

Step-3: Before creating Partitioned table in hive first we set the properties for dynamic partition

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=non-strict;

Step-4: Create a partitioned table in hive

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION_DYNAMIC(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT)PARTITIONED BY (IS_SKIPPED INT);
OK
Time taken: 0.114 seconds

Step-5: Insert value into the partitioned table

hive> INSERT INTO TABLE RADIO_PARTITION_DYNAMIC PARTITION (IS_SKIPPED) SELECT * FROM RADIO1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807124755_4ab2f92f-9008-4dc4-b3f5-9f9db1256964
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0059, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0059/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1501077058026_0059
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:48:04,052 Stage-1 map = 0%,  reduce = 0%
2017-08-07 12:48:10,515 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1501077058026_0059
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition_dynamic/.hive-staging_hive_2017-08-07_12-47-55_730_8469767265901029936-1/-ext-10000
Loading data to table test.radio_partition_dynamic partition (is_skipped=null)

Loaded : 2/2 partitions.
         Time taken to load dynamic partitions: 0.248 seconds
         Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.53 sec   HDFS Read: 5195 HDFS Write: 261 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
_col0   _col1   _col2   _col3   _col4
Time taken: 16.462 seconds



Advantages with Hive Partition:
  1. Distribute execution load horizontally
  2. No need to search entire table columns for a single record.

Disadvantages with Hive Partition:
  1. There is a possibility for creating too many folders in HDFS that is extra burden for Namenode metadata.

4 comments:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Solutions

    Data Lake Companies

    Advanced Analytics Solutions

    Full Stack Development Company

    ReplyDelete
  2. The article is so appealing. You should read this article before choosing the AWS big data consultant you want to learn.

    ReplyDelete
  3. You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.

    ReplyDelete

Fundamentals of Python programming

Fundamentals of Python programming: Following below are the fundamental constructs of Python programming: Python Data types Python...