Share my learning's: 6)Hive Data Definition Language(DDL)- Tables

Hive Data Definition Language(DDL)- Tables

Hive Tables:In Hive, Tables are collection of data records which have same schema for all the records.

Note:
Hive Tables data stored in HDFS and Metadata (Schema of the table) stored in RDBMS.
Hive Table Types:

Create/Drop/Truncate Table:

The below are sections which are going to try one by one:

Create Table

Managed and External Tables

Storage Formats

Row Formats & SerDe

Partitioned Tables

External Tables

Create Table As Select (CTAS)

Create Table Like

Bucketed Sorted Tables

Skewed Tables

Temporary Tables

Constraints

Drop Table

Truncate Table

Create Table:

Syntax:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
[
   [ROW FORMAT row_format]
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

Usage of arguments:

TEMPORARY – Specified for creation of temporary tables

EXTERNAL – Specified only for external tables

IF NOT EXISTS – Suppresses error messages when a table already exists with same name and ignores creation of table again even if there is a schema difference between existing table and new table.

db_name – Optional but can be used to specify the table under a target database, if we are not already working under it.

COMMENT – we can add comments to table as well as to columns (within single quotes) to provide descriptive information to users.

constraint_specification - to define primary or foreign key for the table

PARTITIONED BY – This clause is useful to partition the tables based on particular columns.

CLUSTERED BY – This clause is used to provide more structure to tables and partitions.

SKEWED BY – This clause is useful to create skewed tables.

ROW FORMAT – This clause is used to specify the format of each row in the input data.
Note:
If data fields are delimited by certain characters we can use DELIMITED sub-clause or we need to provide a SERDE that can serialize or deserialize the input data records.

Syntax for DELIMITED Clause:

row_format
: DELIMITED
[FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]

Default values of DELIMITED Clause:

ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    COLLECTION ITEMS TERMINATED BY '\002'
    MAP KEYS TERMINATED BY '\003'
    LINES TERMINATED BY '\n'

Note:
Default field delimiter is Ctrl+A (octal representation – ‘\001’) (also represented as ^A),
Collection delimiter is Ctrl+B (‘\002’ or ^B)
Map keys are terminated by Ctrl+C (‘\003’ or ^C)
and lines terminated by new line feed ‘\n’.

Syntax for SERDE Clause:
row_format
:SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

STORED AS – Storage file format can be specified in this clause

Available storage file formats:

SEQUENCEFILE
TEXTFILE
RCFILE
PARQUET
ORC
AVRO
INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

STORED BY class_name [WITH SERDEPROPERTIES (…)]
It is an alternative to above two clauses (ROW FORMAT & STORED AS) to provide custom row format handler class_name and custom serde properties.

LOCATION – Directory location for table data

TBLPROPERTIES – Metadata key/value pairs can be tagged to the table.

Sample TBLPROPERTIES:
TBLPROPERTIES ("comment"="table_comment")
TBLPROPERTIES ("hbase.table.name"="table_name") //for hbase integration
TBLPROPERTIES ("immutable"="true") or ("immutable"="false")
TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE")
TBLPROPERTIES ("transactional"="true") or ("transactional"="false") default is "false"
TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false"
TBLPROPERTIES ("EXTERNAL"="TRUE") , To change the managed table to external table vice versa.
AS select_statement – AS clause is used to create table similar to the schema of the select_statement and also know as CTAS (Create Table AS) clause.

Create Table Like:
LIKE clause with CREATE TABLE can be used to create tables similar to another existing tables or views.

Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];

1.Managed and External Tables:

Basically two types of Hive tables:

Managed Tables
External Tables

*Managed Tables:

Managed tables data is managed by Hive, moving data into its warehouse directory, configured via hive.metastore.warehouse.dir

Note: A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/

If managed table is dropped both data and metadata (schema) are deleted. i.e. these tables are owned by Hive.

Less convenient to share with other tools like Pig, HBase etc, as these are maintained by Hive and data can be deleted without informing these tools

Example:

Example: Managed Table:

Dataset:
true|10|100|1000|10000|4.0|20.0|2.2222|1969-12-31 15:59:58.174|1970-01-01 00:00:00|hello|hello|k1:v1,k2:v2|100,200|10, "foo"
true|20|200|2000|20000|8.0|40.0|4.2222|1970-12-31 15:59:58.174|1971-01-01 00:00:00|||k3:v3,k4:v4|200,300|20, "bar"

Hive QL:

hive> CREATE TABLE IF NOT EXISTS MANAGED_TABLE
> (
> BOOL BOOLEAN,
> TI1 TINYINT,
> SMI SMALLINT,
> IN1 INT,
> BINT BIGINT,
> FL FLOAT,
> DB DOUBLE,
> DEC DECIMAL,
> TMS1 TIMESTAMP,
> TMS2 TIMESTAMP,
> DATA1 STRING,
> DATA2 VARCHAR(5),
> KMAP MAP<STRING,STRING>,
> VARAY ARRAY<INT>,
> STR STRUCT<C1:INT,NAME:STRING>
> )
> COMMENT "EXAMPLE TABLE THAT USES ALL THE DATA TYPES OF HIVE"
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':'
> STORED AS TEXTFILE;
OK
Time taken: 0.232 seconds

Output:

hive> LOAD DATA LOCAL INPATH '/home/hadoop/apache-hive-2.1.1-bin/examples/files/Done/alltypes.txt' INTO TABLE MANAGED_TABLE;

Loading data to table default.managed_table

OK

Time taken: 0.213 seconds

hive> SELECT * FROM MANAGED_TABLE;
OK
managed_table.bool managed_table.ti1 managed_table.smi managed_table.in1 managed_table.bint managed_table.fl managed_table.db managed_table.dec managed_table.tms1 managed_table.tms2 managed_table.data1 managed_table.data2 managed_table.kmap managed_table.varay managed_table.str
true 10 100 1000 10000 4.0 20.0 2 1969-12-31 15:59:58.174 1970-01-01 00:00:00 hello hello {"k1":"v1","k2":"v2"} [100,200] {"c1":10,"name":" \"foo\""}
true 20 200 2000 20000 8.0 40.0 4 1970-12-31 15:59:58.174 1971-01-01 00:00:00 {"k3":"v3","k4":"v4"} [200,300] {"c1":20,"name":" \"bar\""}
Time taken: 0.109 seconds, Fetched: 2 row(s)

Web UI:

*External Tables:

External tables are not managed or owned by Hive and data will not be copied into hive warehouse directory but maintained at external location

If external tables are dropped only the schema from metastore will be deleted but not the data files from external location.

Provides convenience to share the tables data with other tools like Pig, HBase, etc…

Note:

“Location” Clause is mandatory to create an external table
If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.

Example: External Table

Load or Make the data exist on hdfs at any other directory apart from hive warehouse,

hadoop@Mano:~$ hadoop fs -put '/home/hadoop/Mano/alltypes2.txt' /Mano/External_table
17/09/19 07:52:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

hive> CREATE EXTERNAL TABLE IF NOT EXISTS MANO_BB.EXTERNAL_TABLE
> (
> BOOL BOOLEAN,
> TINY1 TINYINT,
> SMIN SMALLINT,
> DINT INT,
> BINT BIGINT,
> FL FLOAT,
> DB DOUBLE,
> DEC DECIMAL,
> TIMEST1 TIMESTAMP,
> TIMEST2 TIMESTAMP,
> DATA1 STRING,
> DATA2 VARCHAR(10),
> DATA3 CHAR(20),
> DMAP MAP<STRING,STRING>,
> DARR ARRAY<INT>,
> DSTRC STRUCT<C1:INT,C2:STRING>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':'
> STORED AS TEXTFILE
> LOCATION '/Mano/External_table/';

Output:
hive> SELECT * FROM MANO_BB.EXTERNAL_TABLE;
OK
external_table.bool external_table.tiny1 external_table.smin external_table.dint external_table.bint external_table.fl external_table.db external_table.dec external_table.timest1 external_table.timest2 external_table.data1 external_table.data2 external_table.data3 external_table.dmap external_table.darr external_table.dstrc
true 10 100 1000 10000 4.0 20.0 4 1969-12-31 15:59:58.174 NULL string hello hello {"k1":"v1","k2":"v2"} [100,200] {"c1":10,"c2":" \"foo\""}
false 20 200 2000 20000 8.0 40.0 2 1970-12-31 15:59:58.174 NULL abcd world world {"k3":"v3","k4":"v4"} [200,300] {"c1":20,"c2":" \"bar\""}
Time taken: 0.916 seconds, Fetched: 2 row(s)

Note:

Even though if we drop the external table, the data will exist always,

2.Storage/File Formats:

A file format is a way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file.

As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.
These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O.

Note:

Hive supports both built-in and custom-developed file formats.
Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.

File formats can be set on the property called hive.default.fileformat

Below are the some built in Hive formats:

STORED AS TEXTFILE
STORED AS SEQUENCEFILE
STORED AS RCFILE
STORED AS ORC
STORED AS AVRO
STORED AS PARQUET
STORED BY -- Need to check
INPUTFORMAT and OUTPUTFORMAT -- Need to check

Let's discuss each one by one,

1)STORED AS TEXTFILE:

In Hive if we define a table as TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces, and JSON data.

Note:

TEXTFILE format is the default

TEXTFILE format in Hive:

Syntax:

CREATE [TEMPORARY|EXTERNAL] TABLE IF NOT EXISTS DB_NAME.TABLE_ABLE(SCHEMA)
ROW FORMAT row_formats
STORED AS TEXTFILE;

Note:

TEXTFILE input and TEXTFILE output format are present in the Hadoop package

org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.mapred.TextOutputFormat

Example:

Creating TEXTFILE

Hive QL:

hive> CREATE TABLE IF NOT EXISTS TEXTFILE_FORMAT
> (
> EMP_ID INT,
> NAME STRING,
> PRIMARY KEY(EMP_ID) DISABLE NOVALIDATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE;

Since it's TEXTFILE format , so can load any data files, either any text file,csv,tabs and JSON files,

Loading Data: which is of text file
hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/employee.dat' OVERWRITE INTO TABLE TEXTFILE_FORMAT;

hive> SELECT * FROM TEXTFILE_FORMAT;

2)STORED AS SEQUENCEFILE

Sequence files are flat files consisting of binary key-value pairs.

Note:
Files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file.

There are three types of sequence files:

Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here
Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed.

SEQUENCEFILE format in Hive:

Syntax:
CREATE [TEMPORARY|EXTERNAL] TABLE IF NOT EXISTS DB_NAME.TABLE_ABLE(SCHEMA)
ROW FORMAT row_formats
STORED AS SEQUENCEFILE;

SEQUENCEFILE input and output formats from the following packages:

org.apache.hadoop.mapred.SequenceFileInputFormat
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Note:
Hive, it verifies if the file format matches the table definition or not.

Example issue:

If we try to load the text file into SEQUENCE FILE FORMAT table in hive, returns an error

hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/employee.dat' INTO TABLE SEQUENCEFILE_FORMAT;

FAILED: SemanticException Unable to load data to destination table. Error: The file that you are trying to load does not match the file format of the destination table.

Example:
Creating SEQUENCEFILE

Hive QL:

hive> CREATE TABLE IF NOT EXISTS SEQUENCEFILE_FORMAT(EMP_ID INT,NAME STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> LINES TERMINATED BY '\n'
> STORED AS SEQUENCEFILE;

Loading Data: from another table or it should be binary file

INSERT OVERWRITE TABLE SEQUENCEFILE_FORMAT SELECT * FROM TEXTFILE_FORMAT;

hive> SELECT * FROM SEQUENCEFILE_FORMAT;

WEB UI: RC FORMAT - 483B

3)STORED AS RC:

RCFILE is Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.

RCFILE format in Hive:

Syntax:
CREATE [TEMPORARY|EXTERNAL] TABLE IF NOT EXISTS DB_NAME.TABLE_ABLE(SCHEMA)
ROW FORMAT row_formats
STORED AS RCFILE;

RCFILE input and output formats from the following packages:

org.apache.hadoop.hive.ql.io.RCFileInputFormat
org.apache.hadoop.hive.ql.io.RCFileOutputFormat

RCFILE format in Hive:

Syntax:
CREATE [TEMPORARY|EXTERNAL] TABLE IF NOT EXISTS DB_NAME.TABLE_ABLE(SCHEMA)
ROW FORMAT row_formats
STORED AS RCFILE;

Example issue:
IF we try to load the text file into RCFILE FORMAT table in hive, returns an error

hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/employee.dat' INTO TABLE RCFILE_FORMAT;
FAILED: SemanticException Unable to load data to destination table. Error: The file that you are trying to load does not match the file format of the destination table.

Example:
Creating RCFILE

Hive QL:
hive> CREATE TABLE IF NOT EXISTS RCFILE_FORMAT(EMP_ID INT,NAME STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> LINES TERMINATED BY '\n'
> STORED AS RCFILE;

Loading Data: from another table or it should be binary file

INSERT OVERWRITE TABLE RCFILE_FORMAT SELECT * FROM TEXTFILE_FORMAT;

WEB UI: RC FORMAT - 207B

4)STORED AS ORC:
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats.
ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB).

Note:
As a result the speed of data processing also increases.

ORC format in Hive:
Syntax:

CREATE [TEMPORARY|EXTERNAL] TABLE IF NOT EXISTS DB_NAME.TABLE_ABLE(SCHEMA)
ROW FORMAT row_formats
STORED AS ORC;

Hive has on it's packages, ORC input and output formats from the following packages:

org.apache.hadoop.hive.ql.io.orc

Example issue:
IF we try to load the text file into ORCFILE FORMAT table in hive, returns an error

hive> LOAD DATA LOCAL INPATH '/home/hadoop/Mano/employee.dat' INTO TABLE ORC_FORMAT;
FAILED: SemanticException Unable to load data to destination table. Error: The file that you are trying to load does not match the file format of the destination table.

Example:
Creating RCFILE

Hive QL:

hive> CREATE TABLE IF NOT EXISTS ORC_FORMAT(EMP_ID INT,NAME STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> LINES TERMINATED BY '\n'
> STORED AS ORC;

Loading Data: from another table or it should be binary file

LOAD DATA LOCAL INPATH '/home/hadoop/Mano/employee.dat' INTO TABLE ORC_FORMAT;

Other Tables:
1)Temporary Tables:

These are temporary and available till end of current session only.

Useful in case of creating intermediate tables to copy data records from one table to another but can be deleted after our copy operation.

Table’s data will be stored in the user’s scratch directory configured by hive.exec.scratchdir, and deleted at the end of the session.

Note:
Temporary tables doesn’t support Partitioning & Indexing.

2)Skewed Tables:

Skewed tables to improve performance of tables with one or more columns having skewed (repeated) values.

Hive will split the skewed (very often) values records into separate files.

Skewed tables are not separate table types, it can be managed or external.

6 comments:

Tejuteju24 June 2018 at 22:57
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Big data Hadoop online training India
Alfred Avina29 January 2020 at 01:11
The main motive of the AWS big data consultant is to spread the knowledge so that they can give more big data engineers to the world.

Elon Musk6 September 2020 at 07:40
That is really nice to hear. thank you for the update and good luck. for more info
SHAHZAIB3 November 2020 at 00:53
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. https://www.vanityliving.com/collections/side-tables-dubai-uae
usama27 May 2021 at 00:27
I read your blog frequently and I just thought I’d say keep up the amazing work! ankara dil kursu
SEO28 January 2022 at 18:21
I went over this website and I believe you have a lot of wonderful information, saved to my bookmarks transfer file

Share my learning's

Friday, 8 September 2017

6)Hive Data Definition Language(DDL)- Tables

6 comments:

Fundamentals of Python programming