Hadoop Guru: June 2013

Data Velocity

Big Data is not only storing large amount of data and processing it in economical way to pull some oil or value out of it.

We have to handle the speed of Data Generation & Data Delivery also.

Lets observe some examples -

Sensor Data [Thermometer sensing temperature and collecting data continuously]
CCTV Cameras [Capturing Images continuously]
FB Posts [Adds 500+ TB per day]
Tweeter data [98,000+ Tweets per 60 seconds]
YouTube Videos [60 Hours of videos uploaded per minute]
Mobile Towers [Collects few GB's of data every hour]
...etc
(Approximate figures collected from some websites)

there are many more real time examples which are generating data at very high speed.
Big data technologies have to handle this challenge which we call "Data Velocity".

Data Variety

Big Data is very popular for handling different varieties of data. Variety here talks about type of data to be stored and processed.

Types of Data Variety:

Strutured Data
Unstructured Data
Semi-Structured Data

Structured Data

Data which has some structure or more precisely we can say data which has rigid schema.

Example:
Relational Databases - IBM DB2, Teradata, Oracle Databases, etc

All above databases have rigid structure and we have follow the structure while loading the data. If we consider table in relational database, it will be created with column definition and whenever we have to load data into this table, column definition has to followed, we cannot enter character data into Integer column.

Some Useful Teminologies:
whenever we have strictly follow the schema then it is termed as Rigid Schema.

Whenever we have to strictly follow the table schema while loading or writing the data , is termed as Schema on Write.

Whenever we have to follow schema while reading or accessing data, is termed as Schema on Read.

UnStructured Data

Data which has no structure or more precisely we can say data which has no schema.

Example:
Text data, FaceBook posts, Tweeter tweets, Images, videos, logs (Web logs, Audit logs, System logs), emails, Sensor data, CCTV footage, market events, data from socila feeds, mobile phone calls call center conversatons, etc.

Today world has close to 90% of unstructured data and its growing at very high speed.
90% of unstructured data is created within last 3-4 years only. This sudden data explosion has gifted Big data Technologies very high growth.

In 2010, Big data market was ~ $3.2 Billion
In 2015, Forcasted ~ $17 Billion
In 2017, ~ $20 Billion
(Dont rely on above forcasted values as collected from some websites, these are approxiamations only)

Semi-Structured Data

Data which has some structure or we can say data which has some schema but it is not required to follow the schema rigidly.

Example: Excel sheets

In excel sheets, we can store data in form of rows and columns. we can declare the definition for columns as we do in relational database table declaration, but here we can enter character data in numeric column and numeric data in character columns.

Data with structure or schema but not required to follow rigidly while loading data is termed as Semi Structured data.

Big Data - Volume | Velocity | Variety

3 Vs of Big Data

Volume: Size, Amount or Quantity of Data.
More.....

Velocity: Speed of data.

Speed at which data must be stored.
Speed at which data must be processed.

More....

Variety: Type of data to be stored or processed.

Structured Data
Unstructured Data
Semi-Structured Data

More....

Data Volume

When we talk about big data, first thing comes in mind is amount of data or size of data or quantity of data.

Here Problem is -
How to store large amount of data ? ...No Problem....use.... HDFS
How to process large amount of data ? ...No problem....use.... Hadoop Map Reduce

Big Data volume speaks about data which is larger than normal volume of data which was so far stored and processed by traditional system.

If data is larger than normal amount of data then definitely its challenging and expensive to load and process this data. How it can be achieved, let see this in Hadoop Framework.

Data units in Big Data


	Disk Storage	Processor or Virtual Storage
1 GB	1000 MBs	1024 MBs	1 x 10⁹ Bytes
1 TB	1000 GBs	1024 GBs	1 x 10¹² Bytes
1 PB	1000 TBs	1024 TBs	1 x 10¹⁵ Bytes
1 EB	1000 PBs	1024 PBs	1 x 10¹⁸ Bytes
1 ZB	1000 EBs	1024 EBs	1 x 10²¹ Bytes
1 YB	1000 ZBs	1024 ZBs	1 x 10²⁴ Bytes
1 BB	1000 YBs	1024 YBs	1 x 10²⁷ Bytes
1 GeopB	1000 BBs	1024 BBs	1 x 10³⁰ Bytes

Here,

GB - GigaBytes
TB - TeraBytes
PB - PetaBytes
EB - ExaBytes
ZB - ZettaBytes
YB - YottaBytes
BB - BrontoBytes
GeopB - GeopBytes

Hadoop Guru

Pages

Saturday, 22 June 2013