Thursday 3 August 2017

8)Hive bucketing:

Default Partition:

Default Partition in Hive is called as Bucketing.

Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

Hive bucketing:

  • Hive bucketing is responsible for dividing the data into number of equal parts
  • If you want to use bucketing in hive then you should use CLUSTERED BY (Col) command while creating a table in Hive
  • We can perform Hive bucketing concept on Hive Managed tables or External tables
  • We can perform Hive bucketing optimization only on one column only not more than one.
  • The value of this column will be hashed by a user-defined number into buckets.

Advantages with Hive Bucketing: 
Due to equal volumes of data in each partition, joins at Map side will be quicker.
Faster query response like partitioning

Disadvantages with Hive Bucketing :
You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.

Hive table creation:
Step-1: Create a Hive table

CREATE TABLE IF NOT EXISTS RADIO(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Step-2: Load value into the Hive table
LOAD DATA LOCAL INPATH '/home/hadoop/Mano/Radio_log.txt' INTO TABLE RADIO;

Step-3: First set the property before create bucketing table in hive
set hive.enforce.bucketing =true;

Step-4: Create a bucketing table in Hive

CREATE TABLE IF NOT EXISTS RADIO_BUCKET(USER_ID STRING,TRACK_ID INT,IS_SHARED INT,RADIO INT,IS_SKIPPED INT) CLUSTERED BY (IS_SHARED) INTO 2 BUCKETS;

Step-5: Insert the value into the bucketing table

hive> INSERT OVERWRITE TABLE RADIO_BUCKET SELECT * FROM RADIO;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170807130351_a9453218-2518-4527-a61d-7697f282dd19
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1501077058026_0060, Tracking URL = http://Manohar:8088/proxy/application_1501077058026_0060/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1501077058026_0060
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-07 13:03:59,493 Stage-1 map = 0%,  reduce = 0%
2017-08-07 13:04:04,985 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.77 sec
2017-08-07 13:04:14,735 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 3.94 sec
2017-08-07 13:04:15,785 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.04 sec
MapReduce Total cumulative CPU time: 6 seconds 40 msec
Ended Job = job_1501077058026_0060
Loading data to table test.radio_bucket
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.04 sec   HDFS Read: 13297 HDFS Write: 210 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 40 msec
OK
radio.user_id   radio.tack_id   radio.is_shared radio.radio     radio.is_skipped
Time taken: 26.662 seconds

Step-6: View the value of first bucket in the bucketing table
hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 2 ON IS_SHARED);
OK
radio_bucket.user_id    radio_bucket.track_id   radio_bucket.is_shared  radio_bucket.radio      radio_bucket.is_skipped
12345   222     0       1       1
12345   521     0       1       0
Time taken: 0.146 seconds, Fetched: 2 row(s)



Step-7: View the value of SECOND bucket in the bucketing table

hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(BUCKET 2 OUT OF 2 ON IS_SHARED);
OK
radio_bucket.user_id    radio_bucket.track_id   radio_bucket.is_shared  radio_bucket.radio      radio_bucket.is_skipped
23456   225     1       0       0
34567   521     1       0       0
Time taken: 0.076 seconds, Fetched: 2 row(s)

Step-8: View the 10% of value in the bucketing table

hive> SELECT * FROM RADIO_BUCKET TABLESAMPLE(10 PERCENT);
OK
radio_bucket.user_id    radio_bucket.track_id   radio_bucket.is_shared  radio_bucket.radio      radio_bucket.is_skipped
12345   222     0       1       1
Time taken: 0.049 seconds, Fetched: 1 row(s)

Step-9: View the value in limit 2

hive> SELECT * FROM RADIO_BUCKET LIMIT 2;
OK
radio_bucket.user_id    radio_bucket.track_id   radio_bucket.is_shared  radio_bucket.radio      radio_bucket.is_skipped
12345   222     0       1       1
12345   521     0       1       0
Time taken: 0.092 seconds, Fetched: 2 row(s)

No comments:

Post a Comment

Fundamentals of Python programming

Fundamentals of Python programming: Following below are the fundamental constructs of Python programming: Python Data types Python...