Hadoop Guru: Map Reduce

Wednesday, 6 March 2013

Map Reduce

As an ongoing trend, tremendously increasing amounts of data are collected in real-world applications of life science, engineering, telecommunication, business transactions and many other domains. For the management and analysis of these data, many different techniques and algorithms have been developed, ranging from basic database operations to high-level data mining approaches like clustering, classification or the detection of outliers. Processing huge data sets with millions or billions of records on a single computer exceeds the computation capabilities of single computing nodes due to limitations of disk space and/or main memory. Thus, it is indispensable to develop distributed approaches that run on clusters of several computers in parallel.

For the development of distributed algorithms, a variety of structured programming models exists. Aside classic parallel programming, the MapReduce model was proposed by Google, and its open-source implementation Hadoop found wide-spread attention and usage.

In MapReduce, the data is given a
list of records that are represented as (key, value) pairs. Basically, a MapReduce program consists of two phases: In the "Map" phase, the records are arbitrarily distributed to different computing nodes (called "mappers") and each record is processed separately, independent of the other data items. The map phase then outputs intermediate (key, value) pairs. In the "Reduce" phase, records having the same key are grouped together and processed in the same computing node ("reducer"). Thus, the reducers combine information of different records having the same key and aggregate the intermediate results of the mappers. The results are stored back to the distributed file system.

On top of this new programming model, Hadoop and other implementations of the MapReduce framework show a lot of non-functional advantages: They are scalable to clusters of many computing nodes, which are easily expanded by new nodes. They are fault-tolerant: If one of the computing nodes fails during the execution of the program, the work of the other nodes is not affected or discarded, just the records that were currently processed on the failing node have to be processed again by another node. This fault tolerance particularly supports running Hadoop on commodity hardware. For example, organizations often have tens or hundreds of desktop computers which are only used at certain times of day and to the most part just for office applications. These computers often show much unused capacity in terms of processor time and disk space. Using Hadoop, these available resources can be easily used for distributed computing.

The goal of the research is the development of high parallelizable data mining techniques with MapReduce framework..

.......More to follow on Map Reduce
       1. Map reduce Process.
       2. Map Reduce examples
       3. What Map reduce can do?
       4. What Map Reduce Cannot do ?

4 comments:

Anonymous said...: plz post the remaining concepts....

1. Map reduce Process.
2. Map Reduce examples
3. What Map reduce can do?
4. What Map Reduce Cannot do ?

with details explanation ....!!
Plz mail me narsi.setti@gmail.com

and THANKS.....!; 24 July 2013 at 14:40
sas-online-training said...: Splunk is a software that enables, and manages search data from any application, server, and network device in no time. Splunk makes machine data reachable, utilizable and helpful to everyone. It enables the curious to look closely at what others ignore machine data and find what others never see: insights that can help make your company more productive, profitable, competitive and secure.

Revanth Technologies offers Splunk online training with real time concepts and with real time scenarios. With Revanth Technologies provides deep knowledge of Splunk services and their connectivity. Every student will understand high availability concepts and implementations. After finishing this Course in Revanth Technologies you will be well versed in Splunk installation and configuration, Splunk indexes, monitoring and scaling large volumes of search, Report creation, analyzing and sorting data with the Splunk tool. We are providing a free demo class for the students.

Our Online Training Institute's unique features.

1. The trainers have ample experience in this field of work.
2. 24/7 support will be provided.
3. Mock Interviews will be conducted to the students who completed online trainings.
4. Online Training timings will be as per student's convenience only.
5. All the doubts arised at the time of class will be cleared by the trainers.
6. All the training is based on real time scenarios only.

For more details please call us on +91 9290971883, 9247461324 or drop a mail to revanthonlinetraining@gmail.
For course content please click on the below link
http://www.revanthtechnologies.com/splunk-online-training-from-india.php; 11 October 2017 at 11:38
gracylayla said...: Hello,
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in MapReduce Training; 13 February 2018 at 13:04
Unknown said...: This paragraph gives clear idea for the new viewers of blogging, Thanks you .
MapReduce Training in Noida; 9 April 2018 at 16:44

Pages

Wednesday, 6 March 2013

Map Reduce

4 comments: