Hadoop Guru: February 2013

Sunday, 24 February 2013

HDFS Overview

Hadoop comes with a distributed file system called HDFS, which stands for

Hadoop Distributed Filesystem. HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

Distributed File System

There is need of distributed file system because dataset is growing beyond the capacity of single machine storage. But there are few problem associated with distributed file system, chances od data loss due to machine failures & complication of network programming because it is network based file system, thus makes it more complicated than regular file system.

HDFS is designed for specific requirements.

Where HDFS is Good Fit ?

Store large datasets which may be in TB's or PB's or even more. Data Volume
Store different variety of data - Structured | UnStructured | Semi-Structured Data Variety
Store data on commodity hardware (Economical).

Where HDFS is not a good fit ?

Low Latency Data Access (Hbase is better option)
Huge number of small files (Upto Millions is ok, but Billions is beyond the capacity of current hardwares. Basically NameNode Metadata storage capacity is the problem).
Random file access (Random Read, write, delete or insert is not possible, Hadoop doesn't support OLTP. RDBMS is the best fit for OLTP operations).

HDFS has a Master/Slave Architecture

Master ==> NameNode

Slave ==> DataNode

Block Placement

Current Stretegy - One replica on local node, second replica on remote rack, third replica on same remote rack and additional replicas (if replication factor > 3) are randomly placed.
Client read from nearest replica.

Heartbeats

Data Nodes sends heartbeats to NameNode every 3 seconds.
NameNode uses heartbeats to detects datanode failure.

Replication Engine

NameNode detects dataNode failures

Chooses new dataNodes for new replicas.
Balances disk usage.
Balances communication traffic to DataNodes.

NameNode Failure

A Single point of failure
Transaction logs are stored in multiple directories.

- A directory on the local file system.

- A directory on a remote file system (NFS/CIFS).

Data Pipelining

Client retrieves the list of DataNodes on which to place replicas of a block.
Client writes block to the first DataNode.

- The first DataNode forwards the data to the next DataNode in the pipeline.

- When all replicas are written, the client moves on to write the next block in file.

Secondary NameNode

Copies FsImage & Transaction Log from NameNode to a temporary directory.
Merges FsImage and Transaction log into new FsImage in temporary directory.
Uploads new FsImage to NameNode and transaction log on NameNode is purged.

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

Storing & Querying Big Data in Hadoop Distributed File System:

Why Hadoop ?

Hadoop processes huge datasets on large clusters of computer (Commodity Hardware).
Highly scalable.
Highly fault tolerant - Failure is expected, rather than exception.
Need common infrastructure - Efficient, reliable, easy to use, open source, Apache License.
Applicable to many computing problems.
Cross platform (pure Java).

Who uses hadoop ?

Amazon/A9
Facebook
Yahoo
IBM : Blue Cloud ?
Joost
Last.fm
New York Times
Powerset
Veoh! .... etc.

.....More information will be shared soon

Friday, 22 February 2013

Hadoop Framework

Hadoop is framework written in Java.

1. Scalable fault tolerant distributed system for large data storage & processing.

2. Designed to solve problems that involve storing, processing & analyzing large data (Terabytes, petabytes, etc.)

3. Programming Model is based on Google’s Map Reduce.

4. Infrastructure based on Google’s Big Data & distributed file system.

5. Key Value

~ Highly Flexible   - Store data without schema & add later as needed

~ Very Economical - Use Commodity Hardware’s for data storage

~ Broadly adopted   - A large & active Ecosystem

~ Proven at scale   - Many Petabytes storage & implementations in production today

Origin of Hadoop:

Doug Cutting is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. The name "Hadoop" was given by Doug Cutting's, he named it after his son's toy elephant. Doug used the name for his open source project because it was easy to pronounce and to Google.

Hadoop is a project with aims of developing open-source software for use in analysis, processing and distribution for BIG data. It was originally developed to support distribution for the Nutch search engine project.

For better or worse, Hadoop has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard.

Big Data Overview

Big Data is very popular today because of its large data storage & processing capabilities which are not handled by traditional relational Database’s.

But Big Data is more than the Biggness of Data.It has few more challenges. We call them -

3 defining characteristics of Big Data
3 Challenges of Big Data
3 V's of Big Data

Volume, Velocity & Variety

More details on 3Vs of Big Data.

In today's world we have almost 80% unstructured data in the form of books, journals, documents, metadata, health records,audio, video, analog data, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document etc. And every years this data is increasing with the percentage of around 40% which doubles this data in every 2 years.

Today we have around 40 Billion web pages, which is huge. Loads of data is getting generated everyday on social networking websites e.g. Facebook,Twitter etc.

And this data is in Terabytes or Petabytes or even more.

BIG Data is Unstructured, Semi Structured or Structured data which is in Terabytes or Petabytes or even more in size.

This flood of data is coming from many sources. Consider the following:

The New York Stock Exchange generates about one terabyte of new trade data per day.

Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.

Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.

The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.

The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.

We can see different Products to process BIG DATA in market are:
1. Apache Hadoop (Open source and very popular today)
2. SAP - HANA
3. IBM Netezza
etc.

Pages