Partitioning:
Main concept of partitioning to improve the performance of processing,
Exploration behind the partition concept:
Note: If you want to use Partitioning in hive then you should use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.
Types of Partition:
Custom Partition:
Custom Partitions are the user defined partitions.
i)Static Partition:
Hive table creation:
Step-1: Create a Hive table
CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Step-2: Load value into the Hive table
LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;
DATA SET OF Radio_log.txt:
12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0
1.Using existing column:
In this type of partition, we use the partition column which is already present in the table.
Example:
Step-1: Create a Hive partition table
hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION(USER_ID STRING,TRACK_ID INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (IS_SHARED INT);
OK
Time taken: 0.169 seconds
Step-2: Insert value into the Partitioned table
INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;
Advantages with Hive Partition:
Main concept of partitioning to improve the performance of processing,
Exploration behind the partition concept:
- Let's consider, if we do a simple query with some conditions using clauses,(select * from table_name where column_name="some_value" ), even though it's simple query, it reads the entire data sets, and this becomes bottleneck for Map Reduce jobs over large data sets,
- We can overcome this issue by implementing partitions in Hive, Hive makes it very easy to implement the partitions by using the automatic partition scheme when the table is created.
- Hive Partitioning dividing the large amount of data into number of pieces of folders based on table columns value.
Note: If you want to use Partitioning in hive then you should use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.
Types of Partition:
Custom Partition:
Custom Partitions are the user defined partitions.
i)Static Partition:
- The column values are known at COMPILE TIME.
- Usually when loading files (big files) into Hive tables static partitions are preferred.
- We can alter the partition in static partition
- Performing Static partition can be done on Hive Managed table or external table.
Hive table creation:
Step-1: Create a Hive table
CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Step-2: Load value into the Hive table
LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;
DATA SET OF Radio_log.txt:
12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0
1.Using existing column:
In this type of partition, we use the partition column which is already present in the table.
Example:
Step-1: Create a Hive partition table
hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION(USER_ID STRING,TRACK_ID INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (IS_SHARED INT);
OK
Time taken: 0.169 seconds
Step-2: Insert value into the Partitioned table
INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;
OUTPUT:
hive> INSERT OVERWRITE TABLE RADIO_PARTITION PARTITION (IS_SHARED=1) SELECT USER_ID,TACK_ID,RADIO,IS_SKIPPED FROM RADIO WHERE IS_SHARED=1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807121225_3997eead-24be-4591-a11c-b64274216160
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0057, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0057/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0057
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:12:34,007 Stage-1 map = 0%, reduce = 0%
2017-08-07 12:12:40,470 Stage-1 map = 100%, reduce = 0%
MapReduce Total cumulative CPU time: 2 seconds 260 msec
Ended Job = job_1501077058026_0057
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition/is_shared=1/.hive-staging_hive_2017-08-07_12-12-25_461_889701329221204816-1/-ext-10000
Loading data to table test.radio_partition partition (is_shared=1)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.26 sec HDFS Read: 4693 HDFS Write: 116 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
user_id tack_id radio is_skipped
Time taken: 17.829 seconds
Step-3: View the Partitione value
hive> SELECT * FROM RADIO_PARTITION
> ;
OK
radio_partition.user_id radio_partition.track_id radio_partition.radio radio_partition.is_skipped radio_partition.is_shared
34567 521 0 0 1
23456 225 0 0 1
Time taken: 0.142 seconds, Fetched: 2 row(s)
2.Using new column:
In this type of partition, we use new partition column
Example:
Step-1: Create a Partitioned Hive table
hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) PARTITIONED BY (YEAR STRING);
Step-2: Insert the value into the table
hive> INSERT OVERWRITE TABLE RADIO_PARTITION1 PARTITION (YEAR='2017') SELECT * FROM RADIO;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807122740_fd6767aa-8272-49d9-9034-4c5b4b45d913
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0058, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0058/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0058
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:27:48,486 Stage-1 map = 0%, reduce = 0%
2017-08-07 12:27:54,886 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.47 sec
MapReduce Total cumulative CPU time: 1 seconds 470 msec
Ended Job = job_1501077058026_0058
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition1/year=2017/.hive-staging_hive_2017-08-07_12-27-40_364_5142052446115216443-1/-ext-10000
Loading data to table test.radio_partition1 partition (year=2017)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.47 sec HDFS Read: 4418 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 470 msec
OK
radio.user_id radio.tack_id radio.is_shared radio.radio radio.is_skipped
Time taken: 15.997 seconds
Step-3: View the partitioned value:
hive> SELECT * FROM RADIO_PARTITION1;
OK
radio_partition1.user_id radio_partition1.track_id radio_partition1.is_shared radio_partition1.radio radio_partition1.is_skipped radio_partition1.year
12345 521 0 1 0 2017
34567 521 1 0 0 2017
12345 222 0 1 1 2017
23456 225 1 0 0 2017
Time taken: 0.126 seconds, Fetched: 4 row(s)
Dynamic Partition:
*The column values are known at RUN/EXECUTION TIME.
*We use dynamic partition while loading from an existing table that is not partitioned.
*We can’t perform alter on Dynamic partition.
*Perform dynamic partition on hive external table and managed table.
NOTE:In dynamic partition the partitioned column of the partitioned hive table is must present in the last column of the existing hive table.
Hive table creation:
Step-1: Create a Hive table
hive> CREATE TABLE IF NOT EXISTS RADIO1(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.029 seconds
Step-2: Load value into the Hive table
hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO1;
Loading data to table test.radio1
OK
Time taken: 0.273 seconds
DATA SET OF Radio_log.txt:
12345|521|0|1|0
34567|521|1|0|0
12345|222|0|1|1
23456|225|1|0|0
Step-3: Before creating Partitioned table in hive first we set the properties for dynamic partition
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=non-strict;
Step-4: Create a partitioned table in hive
hive> CREATE TABLE IF NOT EXISTS RADIO_PARTITION_DYNAMIC(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT)PARTITIONED BY (IS_SKIPPED INT);
OK
Time taken: 0.114 seconds
Step-5: Insert value into the partitioned table
hive> INSERT INTO TABLE RADIO_PARTITION_DYNAMIC PARTITION (IS_SKIPPED) SELECT * FROM RADIO1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807124755_4ab2f92f-9008-4dc4-b3f5-9f9db1256964
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1501077058026_0059, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0059/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0059
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-07 12:48:04,052 Stage-1 map = 0%, reduce = 0%
2017-08-07 12:48:10,515 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1501077058026_0059
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test.db/radio_partition_dynamic/.hive-staging_hive_2017-08-07_12-47-55_730_8469767265901029936-1/-ext-10000
Loading data to table test.radio_partition_dynamic partition (is_skipped=null)
Loaded : 2/2 partitions.
Time taken to load dynamic partitions: 0.248 seconds
Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.53 sec HDFS Read: 5195 HDFS Write: 261 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
_col0 _col1 _col2 _col3 _col4
Time taken: 16.462 seconds
Advantages with Hive Partition:
- Distribute execution load horizontally
- No need to search entire table columns for a single record.
Disadvantages with Hive Partition:
- There is a possibility for creating too many folders in HDFS that is extra burden for Namenode metadata.
This blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Hadoop Training in Anna Nagar
JAVA Training in Chennai
Python Training in Chennai
Selenium Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data course in chennai
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.
ReplyDeleteBig Data Solutions
Data Lake Companies
Advanced Analytics Solutions
Full Stack Development Company
The article is so appealing. You should read this article before choosing the AWS big data consultant you want to learn.
ReplyDeleteYou are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.
ReplyDelete