This content is part of the Essential Guide: An architect's guide: How to use big data

YARN and MapReduce 2.0 elevates big data Hadoop and scheduled processing

YARN represents the biggest architectural change in Hadoop since it's inception over seven years ago. Now, Hadoop goes beyond MapReduce to provide scheduled processing while simultaneously processing big data.

Are you ready to take advantage of the multi-application functionality of Hadoop with MapReduce 2.0, or as it is more affectionately known, YARN? According to Arun Murthy, a big-data architect and a co-founder of Hortonworks, YARN substantially improves the functionality and usability of the Hadoop framework by providing a cluster based resource manager, making applications built on the Hadoop framework far more versatile and flexible than ever before.

With YARN, Hadoop moves beyond just a batch processing system.

Arun Murthy, Hortonworks co-founder

And believe it or not, but by providing a JobTracker that divides resource management and task schedulers into separate daemons, YARN provides major performance gains in the big data arena, over and above the performance gains that Hadoop has already brought to the fore in its relatively short lifespan. "It really is the first major architectural generation in Hadoop since the beginning of the project. With YARN, Hadoop moves beyond just a batch processing system with a single MapReduce application. Now it’s a platform where you can run all sorts of apps when it comes to graph processing and streaming data."

Ready, set, KNIT!

According to Murthy, making your applications YARN ready is straightforward. He worked recently with the Apache Giraph batch processing system and found the process surprisingly easy with the YARN provided interfaces. YARN enabling your applications can, when the happy-path works, be as simple as following these three basic steps:

  1. Submit an application
  2. Get resources from the cluster
  3. Use those resources to complete the data processing application

Interfacing with YARN

According to Murthy, YARN provides a simple interface that allows you to easily submit new applications to the scheduler. Once the Hadoop framework starts the submitted application, it has another interface through which resources in the central resource cluster can be accessed. Once you get the resources together, you can launch any type of process that might be needed, such as a container service or a Unix process. You can run Java, Python or whatever type of program you want. 

Once you have it all set up, configured and launched, you are running your new data processing system within Hadoop, and have all the access to the data in the Hadoop framework through the Hadoop file system, HDFS. Keeping all your data together is more efficient, and managing resources centrally improves performance. Essentially, YARN takes Hadoop beyond MapReduce and big data, and turns it into a powerful, distributed, job-execution tool that can run programs and trigger a variety of processes, regardless of which language they are coded in, or on which platform they are hosted.

It is indeed a revolutionary update to the revolutionary Hadoop platform.


Have you used Yarn yet? Has it lived up to your expectations? Let us know.

You can follow Arun Murthy on Twitter: @acmurthy


Dig Deeper on Big data architecture

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.