Essential Guide

Browse Sections
This content is part of the Essential Guide: Using big data and Hadoop 2: New version enables new applications

YARN, MapReduce 2.0, Hadoop clusters and the Big Data layer cake

MapReduce has matured, and so has Hadoop, and together under the umbrella of YARN, these powerful technologies are working together better than ever to deliver faster and more flexible big data solutions to the enterprise.

As new technology solutions go, Hadoop has made quite a mark in the Big Data world. It emerged at just the right time to help IT get a handle on NoSQL, and has enjoyed burgeoning popularity ever since. Of course, with widespread adoption and a sense of maturity comes the demand for continuous improvement. Fortunately, the platform is rising to the occasion with expanded functionality. According to Arun Murthy, cofounder and architect at Hortonworks, Hadoop 2.0 offers a variety of compelling new features and better performance than ever before. At the center of this success is Yarn—the macramé of the platform world that lets you tie in as many data processing methods as you like.

You still have MapReduce, but now you get more from your investment in Hadoop.

Arun Murthy, Hortonworks co-founder

A deeper look at YARN

So what exactly is YARN?  If you think about Hadoop as a layer cake, the Hadoop Distributed File System (HDFS) is the base layer, providing reliable, redundant storage. MapReduce used to be the next layer. Now, YARN is split off from MapReduce to become a separate layer sitting directly on top of HDFS. Cluster resource management is handled in Yarn, while MapReduce now sits on top of Yarn and handles only data processing. In fact, MapReduce isn't even a top layer anymore. It's just a candle stuck in the top of the cake. There's plenty of room for more candles (alternative data processing models) to be inserted into Yarn. According to Arun, "With Yarn, what we've done is separate out the system and made it kind of a generic distributed operating system. Now you have an API handling different applications, one of them just happens to be MapReduce."

Does Yarn Over-Centralize or Further Fragment the Platform?

Fears that Hadoop might become more brittle or more scattered are apparently ill-founded. Fortunately for Hadoop users, this new model isn't monolithic. It's actually even more modular than before and existing users can deploy it for the same use cases as the original Hadoop without disrupting existing processes. The Hortonworks website touts the new agility of MapReduce in this version: "With MapReduce becoming a user-land library, it can evolve independently of the underlying resource manager layer and in a much more agile manner."

What Yarn does bring together is the ability to manage all your various data processing applications more effectively from the command center within Yarn. Shaun Connolly, VP of Corporate Strategy at Hortonworks, points out the benefits. "It enables the different styles of applications to plug in natively and deeply into the platform as opposed to just running on top and storing data in each GSS. The key point there is if you're able to get the different applications to run natively, there are resources like memory utilization and GPU requirements that can all be handled centrally so the applications can play well together. That's really the key; it converts Hadoop from just a MapReduce batch processing system with HDFS into a multi-application operating platform."

With Yarn, what we've done is separate out the system and made it kind of a generic distributed operating system.

Arun Murthy, Hortonworks co-founder

The goal is to stop separating workloads. Centralizing with Yarn helps avoid or resolve resource contention issues such as one app dominating and gobbling up resources. With Yarn, users don't have to duplicate clusters or set up more than they need just to access the data with their desired applications. Instead they can scale out to 10,000+ clusters all in one place. This tends to make the quality of service more predictable.

Benefits for Big Data

Since the watchword for today's Big Data is analytics, it's not surprising that this is an important selling point for next-gen Hadoop. Batch processing is no longer sufficient to meet enterprise data handling needs. Businesses don't just need snapshots of data—they need a video feed. The generalization of the Hadoop platform allows it to run other types of programming models such as streaming and graph processing, even allowing near real-time analysis.

It doesn't stop with what's currently on the market, either. Yarn is designed with the thread of future development in mind. Enabling new siblings to join the MapReduce framework can even spur more innovation in the open source community as another wave of applications is built on Yarn to take advantage of Big Data. According to the guys at Hortonworks, there's really nothing to lose from moving to next generation Hadoop. As Arun points out, adopters aren't giving up anything. They're just gaining new capabilities. "You still have MapReduce, but now you get more from your investment in Hadoop."


How does YARN improve the way you use Hadoop and big data solutions? Let us know.

Dig Deeper on Application scalability and Java HPC

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Great content about MapReduce  2.0, Really a nice introduction about Yarn, MapReduce has been the driving force behind the growth of the big data industry. You’ll hear it mentioned often, along with associated technologies such as Hive and Pig. Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.