Share my learning's: 7)Hive- Partitioning

Partitioning:

Main concept of partitioning to improve the performance of processing,

Exploration behind the partition concept:

Let's consider, if we do a simple query with some conditions using clauses,(select * from table_name where column_name="some_value" ), even though it's simple query, it reads the entire data sets, and this becomes bottleneck for Map Reduce jobs over large data sets,
We can overcome this issue by implementing partitions in Hive, Hive makes it very easy to implement the partitions by using the automatic partition scheme when the table is created.
Hive Partitioning dividing the large amount of data into number of pieces of folders based on table columns value.

Note: If you want to use Partitioning in hive then you should use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.

Types of Partition:

Custom Partition:
Custom Partitions are the user defined partitions.

i)Static Partition:

The column values are known at COMPILE TIME.
Usually when loading files (big files) into Hive tables static partitions are preferred.
We can alter the partition in static partition
Performing Static partition can be done on Hive Managed table or external table.

Hive table creation:

Step-1: Create a Hive table

CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Step-2: Load value into the Hive table

LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;

DATA SET OF Radio_log.txt:

12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0

1.Using existing column:
In this type of partition, we use the partition column which is already present in the table.

Example:

Step-1: Create a Hive partition table

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION(USER_ID STRING,TRACK_ID INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (IS_SHARED INT);
OK
Time taken: 0.169 seconds

Step-2: Insert value into the Partitioned table
INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;

OUTPUT:

hive> INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = hadoop_20170807121225_3997eead-24be-4591-a11c-b64274216160

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1501077058026_0057, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0057/

Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0057

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-08-07 12:12:34,007 Stage-1 map = 0%, reduce = 0%

2017-08-07 12:12:40,470 Stage-1 map = 100%, reduce = 0%

MapReduce Total cumulative CPU time: 2 seconds 260 msec

Ended Job = job_1501077058026_0057

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition/is_shared=1/.hive-staging_hive_2017-08-07_12-12-25_461_889701329221204816-1/-ext-10000

Loading data to table test.radio_partition partition (is_shared=1)

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.26 sec HDFS Read: 4693 HDFS Write: 116 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 260 msec

user_id tack_id radio is_skipped

Time taken: 17.829 seconds

Step-3: View the Partitione value

hive> SELECT * FROM RADIO_PARTITION

> ;

radio_partition.user_id radio_partition.track_id radio_partition.radio radio_partition.is_skipped radio_partition.is_shared

34567 521 0 0 1

23456 225 0 0 1

Time taken: 0.142 seconds, Fetched: 2 row(s)

2.Using new column:

In this type of partition, we use new partition column

Example:

Step-1: Create a Partitioned Hive table

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (YEAR STRING);

Step-2: Insert the value into the table

hive> INSERT OVERWRITE TABLE RADIO_PARTITION1 PARTITION (YEAR='2017') SELECT * FROM RADIO;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = hadoop_20170807122740_fd6767aa-8272-49d9-9034-4c5b4b45d913

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1501077058026_0058, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0058/

Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0058

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-08-07 12:27:48,486 Stage-1 map = 0%, reduce = 0%

2017-08-07 12:27:54,886 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.47 sec

MapReduce Total cumulative CPU time: 1 seconds 470 msec

Ended Job = job_1501077058026_0058

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition1/year=2017/.hive-staging_hive_2017-08-07_12-27-40_364_5142052446115216443-1/-ext-10000

Loading data to table test.radio_partition1 partition (year=2017)

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.47 sec HDFS Read: 4418 HDFS Write: 151 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 470 msec

radio.user_id radio.tack_id radio.is_shared radio.radio radio.is_skipped

Time taken: 15.997 seconds

Step-3: View the partitioned value:

hive> SELECT * FROM RADIO_PARTITION1;

radio_partition1.user_id radio_partition1.track_id radio_partition1.is_shared radio_partition1.radio radio_partition1.is_skipped radio_partition1.year

12345 521 0 1 0 2017

34567 521 1 0 0 2017

12345 222 0 1 1 2017

23456 225 1 0 0 2017

Time taken: 0.126 seconds, Fetched: 4 row(s)

Dynamic Partition:

*The column values are known at RUN/EXECUTION TIME.

*We use dynamic partition while loading from an existing table that is not partitioned.

*We can’t perform alter on Dynamic partition.

*Perform dynamic partition on hive external table and managed table.

NOTE:In dynamic partition the partitioned column of the partitioned hive table is must present in the last column of the existing hive table.

Hive table creation:

Step-1: Create a Hive table

hive> CREATE TABLE IF NOT EXISTS RADIO1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Time taken: 0.029 seconds

Step-2: Load value into the Hive table

hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO1;

Loading data to table test.radio1

Time taken: 0.273 seconds

DATA SET OF Radio_log.txt:

12345|521|0|1|0

34567|521|1|0|0

12345|222|0|1|1

23456|225|1|0|0

Step-3: Before creating Partitioned table in hive first we set the properties for dynamic partition

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=non-strict;

Step-4: Create a partitioned table in hive

hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION_DYNAMIC(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT)PARTITIONED BY (IS_SKIPPED INT);

Time taken: 0.114 seconds

Step-5: Insert value into the partitioned table

hive> INSERT INTO TABLE RADIO_PARTITION_DYNAMIC PARTITION (IS_SKIPPED) SELECT * FROM RADIO1;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = hadoop_20170807124755_4ab2f92f-9008-4dc4-b3f5-9f9db1256964

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1501077058026_0059, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0059/

Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0059

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-08-07 12:48:04,052 Stage-1 map = 0%, reduce = 0%

2017-08-07 12:48:10,515 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec

MapReduce Total cumulative CPU time: 2 seconds 530 msec

Ended Job = job_1501077058026_0059

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition_dynamic/.hive-staging_hive_2017-08-07_12-47-55_730_8469767265901029936-1/-ext-10000

Loading data to table test.radio_partition_dynamic partition (is_skipped=null)

Loaded : 2/2 partitions.

Time taken to load dynamic partitions: 0.248 seconds

Time taken for adding to write entity : 0.0 seconds

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.53 sec HDFS Read: 5195 HDFS Write: 261 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 530 msec

_col0 _col1 _col2 _col3 _col4

Time taken: 16.462 seconds

Advantages with Hive Partition:

Distribute execution load horizontally
No need to search entire table columns for a single record.

Disadvantages with Hive Partition:

There is a possibility for creating too many folders in HDFS that is extra burden for Namenode metadata.

Share my learning's

Thursday, 3 August 2017

7)Hive- Partitioning

4 comments:

Fundamentals of Python programming