This Hadoop tutorial provides thorough introduction of Hadoop. The tutorial covers what is Hadoop, what is the need of Hadoop, why hadoop is most popular, Hadoop Architecture, data flow, Hadoop daemons, different flavours, introduction of Hadoop componenets like hdfs, MapReduce, Yarn, etc.
Hadoop is an open source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and even its source code can be changed as per the requirements. If certain functionality does not fulfill your requirement, you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Hadoop provides parallel processing of data as it works on multiple machines simultaneously.
It is inspired by Google, which has written a paper about the technologies it is using like Map-Reduce programming model as well as its file system (GFS). Hadoop was originally written for the Nutch search engine project when Doug cutting and his team were working on it but very soon, it became a top-level project due to its huge popularity.
Hadoop is an open source framework which is written in Java. But this does not mean you can code only in Java. You can code in C, C++, perl, python, ruby etc. You can code in any language but it is recommended to code in java as you will have lower level control of the code.
It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is developed for processing of huge volume of data. Commodity hardware are the low end hardware, they are cheap devices which are very economic. So hadoop is very economic.
Hadoop can be setup on single machine (pseudo distributed mode), but real power of Hadoop comes with a cluster of machines, it can be scaled to thousand nodes on the fly ie, without any downtime. We need not make any system down to add more systems in the cluster.
Hadoop consists of three key parts – Hadoop Distributed File System (HDFS), Map-Reduce and YARN. HDFS is the storage layer, Map Reduce is the processing layer and YARN is the resource management layer.
Let us now understand why Hadoop is very popular, why Hadoop has captured more than 90% of big data market.
Hadoop is not only a storage-system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault tolerant (Even if nodes go down, data can be processed by other node) and Open source (can modify the source code if required).