Pages

Saturday 22 June 2013

Data Velocity



Big Data is not only storing large amount of data and processing it in economical way to pull some oil or value out of it.

We have to handle the speed of Data Generation & Data Delivery also. 

Lets observe some examples -

Sensor Data     [Thermometer sensing temperature and collecting data continuously]
CCTV Cameras    [Capturing Images continuously]
FB Posts        [Adds 500+ TB per day]
Tweeter data    [98,000+ Tweets per 60 seconds]
YouTube Videos  [60 Hours of videos uploaded per minute]
Mobile Towers   [Collects few GB's of data every hour]
...etc
(Approximate figures collected from some websites)


there are many more real time examples which are generating data at very high speed.
Big data technologies have to handle this challenge which we call "Data Velocity".




Data Variety



Big Data is very popular for handling different varieties of data. Variety here talks about type of data to be stored and processed.



Types of Data Variety:

  1. Strutured Data
  2. Unstructured Data
  3. Semi-Structured Data



Structured Data

Data which has some structure or more precisely we can say data which has rigid schema.

Example:
Relational Databases   -  IBM DB2, Teradata, Oracle Databases, etc

All above databases have rigid structure and we have follow the structure while loading the data. If we consider table in relational database, it will be created with column definition and whenever we have to load data into this table, column definition has to followed, we cannot enter character data into Integer column.


Some Useful Teminologies:
whenever we have strictly follow the schema then it is termed as Rigid Schema.

Whenever we have to strictly follow the table schema while loading  or writing the data , is termed as Schema on Write.

Whenever we have to follow schema while reading or accessing data, is termed as Schema on Read.



UnStructured Data

Data which has no structure or more precisely we can say data which has no schema.

Example:
Text data, FaceBook posts, Tweeter tweets, Images, videos, logs (Web logs, Audit logs, System logs), emails, Sensor data, CCTV footage, market events, data from socila feeds, mobile phone calls call center conversatons, etc.


Today world has close to 90% of unstructured data and its growing at very high speed.
90% of unstructured data is created within last 3-4 years only. This sudden data explosion has gifted Big data Technologies very high growth.

In 2010, Big data market was ~ $3.2 Billion
In 2015, Forcasted ~ $17 Billion
In 2017, ~ $20 Billion
(Dont rely on above forcasted values as collected from some websites, these are approxiamations only)



Semi-Structured Data

Data which has some structure or we can say data which has some schema but it is not required to follow the schema rigidly.

Example:  Excel sheets

In excel sheets, we can store data in form of rows and columns. we can declare the definition for columns as we do in relational database table declaration, but here we can enter character data in numeric column and numeric data in character columns.

Data with structure or schema but not required to follow rigidly while loading data is termed as Semi Structured data.




Big Data - Volume | Velocity | Variety



3 Vs of Big Data




Volume: Size, Amount or Quantity of Data.
More.....


Velocity: Speed of data.
  • Speed at which data must be stored.
  • Speed at which data must be processed.
More....


Variety: Type of data to be stored or processed.

  • Structured Data
  • Unstructured Data
  • Semi-Structured Data

Data Volume



When we talk about big data, first thing comes in mind is amount of data or size of data or quantity of data.

Here Problem is -
How to store large amount of data ?             ...No Problem....use.... HDFS
How to process large amount of data ?         ...No problem....use.... Hadoop Map Reduce


Big Data volume speaks about data which is larger than normal volume of data which was so far stored and processed by traditional system.

If data is larger than normal amount of data then definitely its challenging and expensive to load and process this data. How it can be achieved, let see this in Hadoop Framework.



Data units in Big Data


Disk Storage Processor or Virtual Storage
1  GB 1000 MBs 1024 MBs  1 x 10⁹  Bytes
1  TB 1000 GBs 1024 GBs  1 x 10¹²  Bytes
1  PB 1000 TBs 1024 TBs  1 x 10¹⁵  Bytes
1  EB 1000 PBs 1024 PBs  1 x 10¹⁸  Bytes
1  ZB 1000 EBs 1024 EBs  1 x 10²¹  Bytes
1  YB 1000 ZBs 1024 ZBs  1 x 10²⁴  Bytes
1  BB 1000 YBs 1024 YBs  1 x 10²⁷  Bytes
1  GeopB 1000 BBs 1024 BBs  1 x 10³⁰  Bytes
Here,

GB       -     GigaBytes
TB       -     TeraBytes
PB       -     PetaBytes
EB       -     ExaBytes
ZB       -     ZettaBytes
YB       -     YottaBytes
BB       -     BrontoBytes
GeopB    -     GeopBytes