Saturday, 29 April 2017

Hadoop - Java-Based Programming Framework

       
     Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

    This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

What is Big Data?

       Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, technqiues and frameworks.
What Comes under Big Data?
          Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.
·        Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
·        Social Media Data : Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.
·        Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
·        Power Grid Data : The power grid data holds information consumed by a particular node with respect to a base station.
·        Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.
·        Search Engine Data : Search engines retrieve lots of data from different databases.
    Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types.
·        Structured data : Relational data.
·        Semi Structured data : XML data.
·        Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data:

             Big data is really critical to our life and its emerging as one of the most important technologies in modern world. Follow are just few benefits which are very much known to all of us:
·        Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.
·        Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.
·        Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.

Big Data Technologies

       Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.
      To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in realtime and can protect data privacy and security.
       There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology:

Operational Big Data

     This include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.
        NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement.
       Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

      This includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.
Big Data Challenges
     The major challenges associated with big data are as follows:
  • Capturing data
  • Curation
  • Storage
  • Searching
  • Sharing
  • Transfer
  • Analysis
  • Presentation

Hadoop


         Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
       Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.
         Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture:


       Hadoop framework includes following four modules:
·        Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.
·        Hadoop YARN: This is a framework for job scheduling and cluster resource management.
·        Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
·        Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

MapReduce

          Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:
·        The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
·        The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.

How Does Hadoop Work?

Stage- 1
       A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:
1.     The location of the input and output files in the distributed file system.
2.     The java classes in the form of jar file containing the implementation of map and reduce functions.
3.     The job configuration by setting different parameters specific to the job.
Stage -2
          The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage -3
           The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop

·        Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
·        Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
·    Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
·        Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
Features of HDFS
  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.

Starting HDFS

Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command.
$ hadoop namenode -format 
After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh 

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input 
Step -2
Transfer and store a data file from local systems to the Hadoop file system using the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input 
Step -3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input 

Retrieving Data from HDFS

Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the Hadoop file system.
Step -1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile 
Step -2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/ 

Shutting Down the HDFS

You can shut down the HDFS by using the following command.
$ stop-dfs.sh 

What is MapReduce?

         MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
        Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

No comments:

Post a Comment

Popular Posts