Share my learning's: 1)Apache Hive:

Monday, 14 August 2017

1)Apache Hive:

Apache Hive:

The Apache Hive is a data warehouse software that is built on top of Apache Hadoop for data analysis., that facilitates reading, writing, and managing large data sets residing in distributed storage(HDFS).

Note:
Hive provides a mechanism to impose structure for a variety of data formats on Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Hive was originated in Facebook.

Apache Hive provides the following features:

Hive tools to enable easy access to data via SQL interface, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
A mechanism to impose structure on a variety of data formats
Hive access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
Hive Query execution via Apache Tez™, Apache Spark™, or MapReduce(Default)

Limitations of Hive:

• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing.

• Hive supports overwriting data, but not updates and deletes.

Hive is used inspite of Pig?

Hive-QL is a declarative language like SQL, PigLatin is a data flow language.
Pig: a data-flow language and environment for exploring very large datasets.
Hive: a distributed data warehouse.

Components of Hive:

1)HCatalog:

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.

2)WebHCat:

WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

Hive Execution engines and properties:

There are currently three execution engines , following are

1.Defualt MapReduce engine,