1. Introduction

Apache Hadoop is a framework designed for the processing of big data sets distributed over large sets of machines with commodity hardware. The basic ideas have been taken from the Google File System (GFS or GoogleFS) as presented in this paper and the MapReduce paper.

A key advantage of Apache Hadoop is its design for scalability, i.e. it is easy to add new hardware to extend an existing cluster in means of storage and computation power. In contrast to other solutions the used principles do not rely on the hardware and assume it is highly available, but rather accept the fact that single machines can fail and that in such case their job has to be done by other machines in the same cluster without any interaction by the user. This way huge and reliable clusters can be build without investing in expensive hardware.

The Apache Hadoop project encompasses the following modules:

  • Hadoop Common: Utilities that are used by the other modules.
  • Hadoop Distributed File System (HDFS): A distributed file system similar to the one developed by Google under the name GFS.
  • Hadoop YARN: This module provides the job scheduling resources used by the MapReduce framework.
  • Hadoop MapReduce: A framework designed to process huge amount of data

The modules listed above form somehow the core of Apache Hadoop, while the ecosystem contains a lot of Hadoop-related projects like Avro, HBase, Hive or Spark.

Read the rest of the tutorial: Apache Hadoop Tutorial