Hadoop Guru: August 2014

High Level Data Flow System on Map Reduce.

It's a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Pigs infrastructure layer consists of -

- A compiler that produces sequences of Map-Reduce programs.

- Pig's language layer currently consists of a textual language called Pig Latin.

Pig Origin -

PIG was originally created at Yahoo! To answer a similar need to Hive.
Many Developer did not have the Java and/or MapReduce knowledge required to write standard MapReduce programs.
But they still needed the Query language.

Solution They got was - PIG

Pig With MapReduce -

Running Pig -

Pig Engine - Parser, Optimizer, distributed query execution.

Grunt Shell - Pig’s interactive shell to enter Pig commands.

Script File - Place Pig commands in a script file & run a script.

Embedded Program - Embed Pig commands in a host language & run the program.

Pig Data Types -

Scalar Types
INT
LONG
FLOAT
DOUBLE
CHARARRAY
BYTEARRAY
BOOLEAN

Complex Types
MAP
TUPLE
BAGS

Pig Features -

Pig provides many features which allow developers to perform sophisticated data analysis without writing MapReduce programs.

Pig vs Hive -

Both have strengths & weaknesses, so its better to spend some time investigating each to make informed decision to choose either Pig or Hive depending on the requirement and type of data to be stored & processed.

Hive is Petabyte scale dataware house system on Hadoop.
Hadoop based system for querying & managing structured data.
Its used to Query Big Data in SQL fashion.

For Execution Hive uses - Map/Reduce
For Storage Hive uses    - HDFS
For Metadata                 - RDBMS

Origin of Hive -
Hive was designed by Facebook for querying from petabytes of data. There was sudden data explosion at Facebook which was impossible to store in traditional DBMS & query.

Hive made users job extremly esay to query data stored on HDFS.
Hive now became parallel DBMS which uses Hadoop for its storage & execution architecture.

Why Hive -
Hive is another dataware house system designed because existing Dataware house systems do not meet all the requirement in scalable , agile & cost effeciant way.

Programming model used in Hadoop is - MapReduce. Its very difficult to write Map-Reduce program for every small or big reports. Also it's requires highly skilled resources to write such a complex code.
Using Hive one can simply issue the query as simple & similar we do in SQL. But here Hive generates Map Reduce code for user based on Query issued.

Advantages of Hive -
Hive can work with very large data (100's to Terabytes).
Hive can work on large hadoop cluster (100's of Nodes).
Data stored on Hive has defined Schema.
Hive is used for Batch jobs also (Load & Query).

Where not to use Hive -
If you need responses in seconds.
If you don't want to impose a schema.
If traditional DBMS already can do the job.
If your data is measured in GB's or even less.
If you don't have enough time & highly skilled resources.

Hive Entities -
Database, Table, Partitions, Bucketing Columns.
MORE....

Hive Data Types -

Primitive Data Types
TINYINT          1 Byte Signed Integer
SMALLINT         2 Byte Signed Integer
INT              4 Byte Signed Integer
BIGINT           8 Byte Signed Integer
BOOLEAN          True or False (Boolean)
FLOAT            Single precision floating bytes
DOUBLE           Double precision floating point
STRING           Sequence of charaters (within Sigle or double quotes)
TIMESTAMP        java.sql.Timestamp format
etc...

Collection Data Types
STRUCT           Similar to Structure in C.
MAP              (Key,Value) pair
ARRAY            Ordered sequence of similar data types.

Hive operations -

DDL operations
[CREATE/ALTER/DROP] [TABLE/VIEW/PARTITION]
CREATE TABLE AS SELECT

DML operations
INSERT OVERWRITE

Queries...
Sub-Queries within "FROM" clause.
Joins [Inner join & Outer (Left, Right & Full outer join)]
Multi-Table insert
Sampling

Interfaces
JDBC/ODBC/THRIFT

Hive UDF's -
Hive user defined functions

UDF's allows user to extend HiveQL.
Builtin UDF's are available in Hive (e.g  SUM, COUNT, .... etc)
Custom UDF's created by User. These can be created by user based on requirement.

Benefits of UDF's
It's possible to extend HiveQL functionality.
Integration of Systems.
Controlling the Map & Reduce stages.
Formatting output data before sending to Pig, Mahout or other programs.

MORE on UDFs....

Hadoop Guru

Pages

Wednesday, 20 August 2014

Pig Basic Understanding

Tuesday, 19 August 2014

Hive Basic Understanding