Cascading: Simplifying Hadoop MapReduce Applications


News: Cascading: Simplifying Hadoop MapReduce Applications

  1. Cascading is an API for creating complex and fault tolerant data processing workflows on top of Hadoop. It abstracts the cluster topology and configuration to enable quick development of complex, distributed applications without having to "think" about MapReduce. Cascading currently relies on Hadoop to provide the storage and execution infrastructure. But the Cascading API insulates developers from the particulars of Hadoop, offering opportunities for Cascading to target different computational frameworks without changes to the original processing work flow definitions. Cascading uses a 'pipe and filters' model for defining data processes. It efficiently supports splits, joins, grouping, and sorting. These are the only processing concepts the developer needs to think about. Nathan Marz shows an example in Goodbye MapReduce, Hello Cascading. A technical overview of the Cascading system explains the overall architecture and implementation details.
  2. We need something like this to simplify mapreduce on datagrids. The next project...
  3. I just accidentally blogged about it today... Regards, Nikita Ivanov. GridGain - Grid Computing Made Simple
  4. Cascading is general purpose[ Go to top ]

    Interesting comments on your linked post Nikita. A couple quick points. Cascading is intended to be a general purpose compute model that will span multiple compute applications. It won't always be a perfect fit, but for specific workload types, it will be a huge benefit. I'm not a GridGain user, but I do recognize there are applications that are best served by more reactive applications than what Hadoop can provide. But as I understand GridGain and Coherence and other data-grid technologies, considering the cost/benefit ratio, I probaby wouldn't stuff a petabyte of data on them for both scheduled and ad-hoc analysis. But for near real-time responses from a subset of hot data, then yes. The bottom line is this. No one tool is a silver bullet, so a mix of applications will be necessary to solve most problems. Providing a easy way for them to integration together and to participate in the same user level workflows and automation is a major win for all parties involved. This is where Cascading aims to help. Also, saying Hadoop and Cascading are for ETL would be missing the big picture. When you take away the data-warehouse (and associated RDBMS databases) and are enabled to perform complex analysis and data mining unrestrained without them, the concept of ETL goes away. Schema restricted datastores (RDBMS and data-warehouses) as simply caches. ETL is only around to load the cache. Hadoop is an unrestrained platform for performing data analysis (data mining, machine learning, data cleansing, etc) on raw data. It starts to make very little sense to stuff data into a schema when you can a lazily resolve the data into a 'view' for use by other processes. Where performance is necessary, caching of the 'views' becomes an attribute or switch, not an architectural component of the system. I touch on this concept here Cascading and Hive. Look forward to you and Victoria's presentations at the Cloud Computing Group meetup. Oh and Billy, I thought you were doing something like this already for IBM? heh chris