Share my learning's: 8)Hive bucketing:

Default Partition:

Default Partition in Hive is called as Bucketing.

Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

Hive bucketing:

Hive bucketing is responsible for dividing the data into number of equal parts
If you want to use bucketing in hive then you should use CLUSTERED BY (Col) command while creating a table in Hive
We can perform Hive bucketing concept on Hive Managed tables or External tables
We can perform Hive bucketing optimization only on one column only not more than one.
The value of this column will be hashed by a user-defined number into buckets.

Advantages with Hive Bucketing:
Due to equal volumes of data in each partition, joins at Map side will be quicker.
Faster query response like partitioning

Disadvantages with Hive Bucketing :
You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.

Hive table creation:
Step-1: Create a Hive table

CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Step-2: Load value into the Hive table
LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;

Step-3: First set the property before create bucketing table in hive
set hive.enforce.bucketing =true;

Step-4: Create a bucketing table in Hive

CREATE TABLE IF NOT EXISTS RADIO_BUCKET(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) CLUSTERED BY (IS_SHARED) INTO 2 BUCKETS;

Step-5: Insert the value into the bucketing table

hive> INSERT OVERWRITE TABLE RADIO_BUCKET SELECT * FROM RADIO;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807130351_a9453218-2518-4527-a61d-7697f282dd19
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1501077058026_0060, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0060/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1501077058026_0060
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-07 13:03:59,493 Stage-1 map = 0%, reduce = 0%
2017-08-07 13:04:04,985 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.77 sec
2017-08-07 13:04:14,735 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 3.94 sec
2017-08-07 13:04:15,785 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.04 sec
MapReduce Total cumulative CPU time: 6 seconds 40 msec
Ended Job = job_1501077058026_0060
Loading data to table test.radio_bucket
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 6.04 sec HDFS Read: 13297 HDFS Write: 210 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 40 msec
OK
radio.user_id radio.tack_id radio.is_shared radio.radio radio.is_skipped
Time taken: 26.662 seconds

Step-6: View the value of first bucket in the bucketing table
hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 2 ON IS_SHARED);
OK
radio_bucket.user_id radio_bucket.track_id radio_bucket.is_shared radio_bucket.radio radio_bucket.is_skipped
12345 222 0 1 1
12345 521 0 1 0
Time taken: 0.146 seconds, Fetched: 2 row(s)

Step-7: View the value of SECOND bucket in the bucketing table

hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(BUCKET 2 OUT OF 2 ON IS_SHARED);
OK
radio_bucket.user_id radio_bucket.track_id radio_bucket.is_shared radio_bucket.radio radio_bucket.is_skipped
23456 225 1 0 0
34567 521 1 0 0
Time taken: 0.076 seconds, Fetched: 2 row(s)

Step-8: View the 10% of value in the bucketing table

hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(10 PERCENT);
OK
radio_bucket.user_id radio_bucket.track_id radio_bucket.is_shared radio_bucket.radio radio_bucket.is_skipped
12345 222 0 1 1
Time taken: 0.049 seconds, Fetched: 1 row(s)

Step-9: View the value in limit 2

hive> SELECT * FROM RADIO_BUCKET LIMIT 2;
OK
radio_bucket.user_id radio_bucket.track_id radio_bucket.is_shared radio_bucket.radio radio_bucket.is_skipped
12345 222 0 1 1
12345 521 0 1 0
Time taken: 0.092 seconds, Fetched: 2 row(s)

Share my learning's

Thursday, 3 August 2017

8)Hive bucketing:

No comments:

Post a Comment

Fundamentals of Python programming