An architect's guide: How to use big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Are you ready to take advantage of the multi-application functionality of Hadoop with MapReduce 2.0, or as it is more affectionately known, YARN? According to Arun Murthy, a big-data architect and a co-founder of Hortonworks, YARN substantially improves the functionality and usability of the Hadoop framework by providing a cluster based resource manager, making applications built on the Hadoop framework far more versatile and flexible than ever before.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
And believe it or not, but by providing a JobTracker that divides resource management and task schedulers into separate daemons, YARN provides major performance gains in the big data arena, over and above the performance gains that Hadoop has already brought to the fore in its relatively short lifespan. "It really is the first major architectural generation in Hadoop since the beginning of the project. With YARN, Hadoop moves beyond just a batch processing system with a single MapReduce application. Now it’s a platform where you can run all sorts of apps when it comes to graph processing and streaming data."
Ready, set, KNIT!
According to Murthy, making your applications YARN ready is straightforward. He worked recently with the Apache Giraph batch processing system and found the process surprisingly easy with the YARN provided interfaces. YARN enabling your applications can, when the happy-path works, be as simple as following these three basic steps:
- Submit an application
- Get resources from the cluster
- Use those resources to complete the data processing application
Interfacing with YARN
According to Murthy, YARN provides a simple interface that allows you to easily submit new applications to the scheduler. Once the Hadoop framework starts the submitted application, it has another interface through which resources in the central resource cluster can be accessed. Once you get the resources together, you can launch any type of process that might be needed, such as a container service or a Unix process. You can run Java, Python or whatever type of program you want.
Once you have it all set up, configured and launched, you are running your new data processing system within Hadoop, and have all the access to the data in the Hadoop framework through the Hadoop file system, HDFS. Keeping all your data together is more efficient, and managing resources centrally improves performance. Essentially, YARN takes Hadoop beyond MapReduce and big data, and turns it into a powerful, distributed, job-execution tool that can run programs and trigger a variety of processes, regardless of which language they are coded in, or on which platform they are hosted.
It is indeed a revolutionary update to the revolutionary Hadoop platform.
You can follow Arun Murthy on Twitter: @acmurthy