This brief
tutorial provides a quick introduction to Big Data, MapReduce algorithm, and
Hadoop Distributed File System.
What
is Big Data?
Big data means
really a big data, it is a collection of large datasets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather
it has become a complete subject, which involves various tools, technqiues and
frameworks.
What Comes under Big Data?
Big data
involves the data produced by different devices and applications. Given below
are some of the fields that come under the umbrella of Big Data.
·
Black Box Data : It is a component of helicopter,
airplanes, and jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of the aircraft.
·
Social Media Data : Social media such as Facebook and
Twitter hold information and the views posted by millions of people across the
globe.
·
Stock Exchange Data : The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
·
Power Grid Data : The power grid data holds
information consumed by a particular node with respect to a base station.
·
Transport Data : Transport data includes model,
capacity, distance and availability of a vehicle.
·
Search Engine Data : Search engines retrieve lots of data
from different databases.
Thus Big Data includes huge volume, high velocity, and
extensible variety of data. The data in it will be of three types.
·
Structured data : Relational data.
·
Semi Structured data : XML data.
·
Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data:
Big data is
really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much
known to all of us:
·
Using
the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
·
Using
the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
·
Using
the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Big
Data Technologies
Big data
technologies are important in providing more accurate analysis, which may lead
to more concrete decision-making resulting in greater operational efficiencies,
cost reductions, and reduced risks for the business.
To harness the
power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can
protect data privacy and security.
There are
various technologies in the market from different vendors including Amazon,
IBM, Microsoft, etc., to handle big data. While looking into the technologies
that handle big data, we examine the following two classes of technology:
Operational Big Data
This include
systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data
systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run
inexpensively and efficiently. This makes operational big data workloads much
easier to manage, cheaper, and faster to implement.
Some NoSQL
systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional
infrastructure.
Analytical Big Data
This includes
systems like Massively Parallel Processing (MPP) database systems and MapReduce
that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
MapReduce provides a new method of analyzing data that is
complementary to the capabilities provided by SQL, and a system based on
MapReduce that can be scaled up from single servers to thousands of high and
low end machines.
Big Data Challenges
The major challenges
associated with big data are as follows:
- Capturing
data
- Curation
- Storage
- Searching
- Sharing
- Transfer
- Analysis
- Presentation
Hadoop
Doug Cutting,
Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy
elephant. Now Apache Hadoop is a registered trademark of the Apache Software
Foundation.
Hadoop runs
applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough
to develop applications capable of running on clusters of computers and they
could perform complete statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across clusters of
computers using simple programming models. A Hadoop frame-worked application
works in an environment that provides distributed storage and computation
across clusters of computers. Hadoop is designed to scale up from single server
to thousands of machines, each offering local computation and storage.
Hadoop
Architecture:
Hadoop
framework includes following four modules:
·
Hadoop Common: These are Java libraries and utilities
required by other Hadoop modules. These libraries provides filesystem and OS
level abstractions and contains the necessary Java files and scripts required
to start Hadoop.
·
Hadoop YARN: This is a framework for job scheduling
and cluster resource management.
·
Hadoop Distributed File System
(HDFS™): A distributed file system that provides high-throughput access to
application data.
·
Hadoop MapReduce: This is YARN-based system for parallel
processing of large data sets.
MapReduce
Hadoop MapReduce is a software framework for easily
writing applications which process big amounts of data in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
The term
MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
·
The Map Task: This is the first task, which takes
input data and converts it into a set of data, where individual elements are
broken down into tuples (key/value pairs).
·
The Reduce Task: This task takes the output from a map
task as input and combines those data tuples into a smaller set of tuples. The
reduce task is always performed after the map task.
How Does Hadoop Work?
Stage- 1
A
user/application can submit a job to the Hadoop (a hadoop job client) for
required process by specifying the following items:
1. The location of the input and output
files in the distributed file system.
2. The java classes in the form of jar
file containing the implementation of map and reduce functions.
3. The job configuration by setting
different parameters specific to the job.
Stage -2
The Hadoop job client then submits the job
(jar/executable etc) and configuration to the JobTracker which then assumes the
responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage -3
The
TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the output
files on the file system.
Advantages
of Hadoop
·
Hadoop
framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines
and in turn, utilizes the underlying parallelism of the CPU cores.
·
Hadoop
does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle
failures at the application layer.
· Servers
can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
·
Another
big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
Features of HDFS
- It
is suitable for the distributed storage and processing.
- Hadoop
provides a command interface to interact with HDFS.
- The
built-in servers of namenode and datanode help users to easily check the
status of cluster.
- Streaming
access to file system data.
- HDFS
provides file permissions and authentication.
Starting
HDFS
Initially you
have to format the configured HDFS file system, open namenode (HDFS server),
and execute the following command.
$ hadoop namenode -format
After
formatting the HDFS, start the distributed file system. The following command
will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh
Listing
Files in HDFS
After loading
the information in the server, we can find the list of files in a directory,
status of a file, using ‘ls’. Given below is the syntax of ls that you can pass
to a directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>
Inserting
Data into HDFS
Assume we have
data in the file called file.txt in the local system which is ought to be saved
in the hdfs file system. Follow the steps given below to insert the required
file in the Hadoop file system.
Step 1
You have to
create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step -2
Transfer and
store a data file from local systems to the Hadoop file system using the put
command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step -3
You can verify
the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving
Data from HDFS
Assume we have
a file in HDFS called outfile. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Step -1
Initially,
view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step -2
Get the file
from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Shutting
Down the HDFS
You can shut
down the HDFS by using the following command.
$ stop-dfs.sh
What
is MapReduce?
MapReduce is a processing technique and a program model
for distributed computing based on java. The MapReduce algorithm contains two
important tasks, namely Map and Reduce. Map takes a set of data and converts it
into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
Hadoop streaming is a utility that comes with the
Hadoop distribution. This utility allows you to create and run Map/Reduce jobs
with any executable or script as the mapper and/or the reducer.
No comments:
Post a Comment