Hadoop Guru: Hive Basic Understanding

Hive is Petabyte scale dataware house system on Hadoop.
Hadoop based system for querying & managing structured data.
Its used to Query Big Data in SQL fashion.

For Execution Hive uses - Map/Reduce
For Storage Hive uses    - HDFS
For Metadata                 - RDBMS

Origin of Hive -
Hive was designed by Facebook for querying from petabytes of data. There was sudden data explosion at Facebook which was impossible to store in traditional DBMS & query.

Hive made users job extremly esay to query data stored on HDFS.
Hive now became parallel DBMS which uses Hadoop for its storage & execution architecture.

Why Hive -
Hive is another dataware house system designed because existing Dataware house systems do not meet all the requirement in scalable , agile & cost effeciant way.

Programming model used in Hadoop is - MapReduce. Its very difficult to write Map-Reduce program for every small or big reports. Also it's requires highly skilled resources to write such a complex code.
Using Hive one can simply issue the query as simple & similar we do in SQL. But here Hive generates Map Reduce code for user based on Query issued.

Advantages of Hive -
Hive can work with very large data (100's to Terabytes).
Hive can work on large hadoop cluster (100's of Nodes).
Data stored on Hive has defined Schema.
Hive is used for Batch jobs also (Load & Query).

Where not to use Hive -
If you need responses in seconds.
If you don't want to impose a schema.
If traditional DBMS already can do the job.
If your data is measured in GB's or even less.
If you don't have enough time & highly skilled resources.

Hive Entities -
Database, Table, Partitions, Bucketing Columns.
MORE....

Hive Data Types -

Primitive Data Types
TINYINT          1 Byte Signed Integer
SMALLINT         2 Byte Signed Integer
INT              4 Byte Signed Integer
BIGINT           8 Byte Signed Integer
BOOLEAN          True or False (Boolean)
FLOAT            Single precision floating bytes
DOUBLE           Double precision floating point
STRING           Sequence of charaters (within Sigle or double quotes)
TIMESTAMP        java.sql.Timestamp format
etc...

Collection Data Types
STRUCT           Similar to Structure in C.
MAP              (Key,Value) pair
ARRAY            Ordered sequence of similar data types.

Hive operations -

DDL operations
[CREATE/ALTER/DROP] [TABLE/VIEW/PARTITION]
CREATE TABLE AS SELECT

DML operations
INSERT OVERWRITE

Queries...
Sub-Queries within "FROM" clause.
Joins [Inner join & Outer (Left, Right & Full outer join)]
Multi-Table insert
Sampling

Interfaces
JDBC/ODBC/THRIFT

Hive UDF's -
Hive user defined functions

UDF's allows user to extend HiveQL.
Builtin UDF's are available in Hive (e.g  SUM, COUNT, .... etc)
Custom UDF's created by User. These can be created by user based on requirement.

Benefits of UDF's
It's possible to extend HiveQL functionality.
Integration of Systems.
Controlling the Map & Reduce stages.
Formatting output data before sending to Pig, Mahout or other programs.

MORE on UDFs....

Hadoop Guru

Pages

Tuesday, 19 August 2014

Hive Basic Understanding

2 comments: