Discussions

News: Article: Why Should You Care About MapReduce

  1. Article: Why Should You Care About MapReduce (24 messages)

    MapReduce is a distributed programming model intended for processing massive amounts of data in large clusters, developed by Jeffrey Dean and Sanjay Ghemawat at Google. MapReduce is implemented as two functions, Map which applies a function to all the members of a collection and returns a list of results based on that processing, and Reduce, which collates and resolves the results from two or more Maps executed in parallel by multiple threads, processors, or stand-alone systems. Both Map() and Reduce() may run in parallel, though not necessarily in the same system at the same time. This article by Eugene Ciurana explains the MapReduce concept in enough detail that implementers can apply it to their own requirements. Read the article Other topics relevant to this may be found in the Scalability Knowledge Center.

    Threaded Messages (24)

  2. MapReduce vs. OLAP[ Go to top ]

    It would be interesting to compare the speed and cost efficiency of MapReduce vs. OLAP systems. Granted OLAP certainly requires more "planning", i.e. you know ahead of time what ways you will want to slice your data, but this is often the case, no?
  3. Re: MapReduce vs. OLAP[ Go to top ]

    It would be interesting to compare the speed and cost efficiency of MapReduce vs. OLAP systems. Granted OLAP certainly requires more "planning", i.e. you know ahead of time what ways you will want to slice your data, but this is often the case, no?
    For some reason there's a tendency to look at MapReduce as an competitor to other technologies. It's not really a comprehensive solution to any one problem. It's an approach to distributing work across many systems. I'm not sure it's correct to consider MapReduce to be at odds with OLAP. It may be possible to use MapReduce as part of an OLAP implementation. For example, one the common uses of MapReduce is to index unstructured data (like the Web, for example.) Personally, I think calling MapReduce 'efficient', as the article does, to be a little odd. The approach carries a good bit of overhead as opposed to standard one machine solutions. The point of MapReduce is that the one machine approach has a hard limit on capacity. No matter how much money you pour into your one machine, many lesser machines will have more capacity and generally at a lower cost. This is becoming more and more important as we approach (apparently) the end of Moore's law (as expanded for processor speeds.) MapReduce is also not the only way to spread work across machines, of course. I don't think this was done using MapReduce but I think illustrates the benefit of this kind of approach: a few years back, a team at a university wanted to create more detailed precipitation (IIRC) maps of the United States. They had the data but needed vast amounts of processing power to make it useful. They put in a request for a supercomputer to the university but it was rejected. What they did was go around and scavenge old computers from around the school and install a linux kernel and java on them. They then took their best machine and made it the master. This machine would parse out the blocks of data to the cluster and consolidate the results. I think they ended up processing all the data faster than they would have with the highest end supercomputer available at the time with an extremely low budget.
  4. Re: MapReduce vs. OLAP[ Go to top ]

    It would be interesting to compare the speed and cost efficiency of MapReduce vs. OLAP systems. Granted OLAP certainly requires more "planning", i.e. you know ahead of time what ways you will want to slice your data, but this is often the case, no?
    My from understanding of OLAP, any comparison wouldn't be apples to apples. Depending on the kind of OLAP (MOLAP, ROLAP), you still need to have a well defined model. All OLAP does is it optimizes multi-dimensional queries so they run efficiently. MapReduce is well suited to unstructured data, which isn't easily organized into a schema. In terms of cost, most of the MOLAP products out there employ bitmap indexes, so the cost is constant in many cases. I don't believe MapReduce can make that kind of gaurantee. It's just a way to divide work over data partitions. peter
  5. Not really the right title[ Go to top ]

    This seems like a pretty good explanation of what MapReduce is and how it works but I don't see much about why/ I or anyone else should care. There's a bullet list at the bottom but not much depth.
  6. Re: Not really the right title[ Go to top ]

    Personally, it's the best overview of MapReduce I've seen so far. He can't really tell you how to apply it to your environment because 1) he doesn't know it, and 2) MapReduce simply isn't needed for the vast majority of applications. It's pretty specialised stuff. Arguably, people could break down processing tasks into smaller chunks than they typically do today but these are usually to small to bother with using something like MapReduce.
  7. Re: Not really the right title[ Go to top ]

    Personally, it's the best overview of MapReduce I've seen so far.
    I agree.
    He can't really tell you how to apply it to your environment because...
    Then why not title it "An overview of MapReduce"? An anology would be if I wrote an article titles: "Why you should consider buying a Hybrid automobile" and fill it with the technical details of how Hybrids work. It's not that the article is useless, it's just that the title has little to do with the content.
  8. Re: Not really the right title[ Go to top ]

    2) MapReduce simply isn't needed for the vast majority of applications. It's pretty specialised stuff. Arguably, people could break down processing tasks into smaller chunks than they typically do today but these are usually to small to bother with using something like MapReduce.
    [emphasis mine] I disagree that MapReduce is very specialized. It's an algorithm (or maybe algorithm framework), like a binary search. You can apply it to huge systems like Google or you can apply it to keeping the two cores in your laptop busy. Just like a binary search, it may or may not be important that you know how to implement MapReduce, but it will be important that you know how to design an algorithm to take advantage of a MapReduce implementation.
  9. MapReduce simply isn't needed for the vast majority of applications. It's pretty specialised stuff. Arguably, people could break down processing tasks into smaller chunks than they typically do today but these are usually to small to bother with using something like MapReduce.
    Well, not when it takes about 20 lines of code to implement Map/Reduce ;-) Or what if it only takes one @Gridify annotation to attach to a method to get the job done? Take a look at some of the examples GridGain has here. In reality, MapReduce, when done right, is used quite a lot. We at GridGain (Map/Reduce based Grid Computing Framework) see a lot of use cases from our clients on a daily basis. Imagine, for example, if you have a report that takes 20 seconds to generate and you need to put it on the web. You, of course, can't make your customers wait for 20 seconds for a page to show. In cases like this, a simple and properly designed application of Map/Reduce will allow you to see your results in a matter of maybe 2 to 5 seconds. Or what if you simply need to load balance a long running method on the Grid? Map/Reduce really helps you split your execution into smaller pieces and achieve much better performance and scalability. Best, Dmitriy Setrakyan GridGain - Grid Computing Made Simple
  10. Given that your product is oriented specifically around grid computing, I would assume that Map/Reduce is something that would certainly be of interest to consumers of your product. :) I'm saying that most web development doesn't involve anything quite that intensive or when it is intensive it's for an internal app where a little extra wait isn't a big deal. I'm basing this purely on my experience as a consultant and talking about projects with others in the industry in the city where I reside. I haven't done a study or anything. :)
  11. I'm saying that most web development doesn't involve anything quite that intensive or when it is intensive it's for an internal app where a little extra wait isn't a big deal.
    We have a lot of internal applications (bespoke and off-the-shelf) where the users launch a task and it is processed in the background - anywhere from ~1 minutes to maybe ~30 minutes. We have other scenarios where computing is done offline and stored, so the user can't even specify parameters. MapReduce could potentially be beneficial in these scenarios. I say potentially because (1) the problem must be expressable in MapReduce, and more importantly (2) it really has to be compute bound. If your bottleneck is reading from a normal file system or database MapReduce probably won't help you.
  12. Hi Mark, I would agree to the point that as long as your web app can reasonably scale and run (both processing and data storage wise) on a single box – you don't need grid computing. And of course many web apps do run sufficiently fine on a single box (no sarcasm intended). Best, Nikita Ivanov. GridGain – Grid Computing Made Simple
  13. Ok, again why should I care?[ Go to top ]

    Well I'm not sure where you pointed out the point of why I should care. Maybe I missed it, but if you have a title to that effect, it would seem useful to make that point. Mapping and reducing isn't something I use frequently, especially in the same manner that google uses it. On multi-threading, I admit there is a good bit of work that should be done in that area. SUN added some basic features to JDK 1.5/1.6, but it's just basic stuff... like what you had when .NET launched from the start. What would be useful to me is a clean model for multi-threading a work assignment and tracking the progress. Or maybe a set of classes in the JDK to try a code block, force kill it if it hangs, and retry it a given number of times. Really all this should be in the JDK, but like File copy, it isn't. Oh well.
  14. Follow Up Article?[ Go to top ]

    Greetings. Thanks for all your comments. I'm working now on a follow-up article that will have simple but non-trivial MapReduce implementation. There is tremendous interest in this topic and would like to expand it further. Like Mark Stock mentioned in his posting, the application of MapReduce depends on your problem domain. I'm thinking of writing a simple set of mappers, reducers, and the master and separating each component and the communication subsystems, then carrying out the implementation using Terracotta and Mule for data sharing and I/O, as suggested at the end of the current article. If that works, I plan a third follow-up about Hadoop or a commercial vendor that discusses the advantages and disadvantages of building your own MapReduce infrastructure vs. using the one-size-fits-all approach. What areas of MapReduce would you like to see explored? Do you want to see this in Java? Groovy? A mixed environment? What kinds of data would you process? Is there a problem domain that you suggest? Thanks and cheers, Eugene The Tesla Testament - the most exciting novel of the decade.
  15. Re: Follow Up Article?[ Go to top ]

    Hi Eugene, One suggestion is to use some well-know problem in grid computing so that other solutions can be somehow compared. One such problem is finding number of primes on a given range. This is embarrassingly parallel problem that is trivial enough to fit into the article as a whole and yet almost every grid computing project would have example like that already built (GigaSpaces, GridGain have them, for example). Since this is primarily about promoting Terracotta I would very interested to see how such solution would compare to other frameworks such as Hadoop, Globus, GridGain, GigaSpaces, JPPF to name a few. Once again, having a well known problem solved helps a lot in providing less biased comparison. Best, Nikita Ivanov. GridGain – Grid Computing Made Simple
  16. Re: Follow Up Article?[ Go to top ]

    I'm thinking of writing a simple set of mappers, reducers, and the master and separating each component and the communication subsystems, then carrying out the implementation using Terracotta and Mule for data sharing and I/O, as suggested at the end of the current article.
    Very interesting Eugene. I would love to talk more about what drove you to use Terracotta in this way. Our own Master / Worker framework doesn't do exactly this and I think this might be better in some ways. I would love to help out on a 2nd pass article focused on building out particular use cases for MapReduce. Very kewl, indeed. --Ari
  17. I think this a new name for the old Tuple Based Programming model developed by David Gelernter at Yale and implemented in his Linda programming lang. This was later adopted by Sun in the JAVASPACES API. In fact is see MR as just a subset of the Tuple Based model and more specifically very related to the Master-Worker design pattern. We've been doing this for ages with JAVA, JINI and JAVASPACES. There are impressive commercial and open-source implementations of it (e.g. Gigaspaces, Blitz, IBM's) I've seen implementations of Tuple based programming in other languages as well, just google it.
  18. Alejandro, I think (pretty sure, actually) that the main trust of the Map/Reduce (a.k.a Split/Aggregate or BCAST/REDUCE in MPI speak) is ability to split processing logic and co-locate it with the data it is going to process. Tuple or space considerations are really secondary or implementation-specific to Map/Reduce idea. Best, Nikita Ivanov. GridGain - Grid Computing Made Simple
  19. Tuples?[ Go to top ]

    Alejandro, Reading the article I also was reminded of tuples/linda/javaspaces wrt the way Terracotta was used as a remote shared "memory". It seems to me that tuples and "remote shared memory" all implement a form of the Blackboard pattern. groetjes, Joost
  20. Nice introduction. IIRC from my years in college, this MapReduce technique are the Haskell's equivalent to the "map" and "foldr" functions on lists. I think that concurrency and parallel processing implementations on those functions surelly are fully studied by now (from an academic perspective at least).
  21. advertorial?[ Go to top ]

    From reading the article I get the sense that it has been payed for by Terracotta. There's nothing wrong with that but imo it behooves theserverside to let the readers make an informed judgement of the form of bias that is involved in an article. Personally I'd be interested in the question how this could be implemented with a database and simple locking mechanism as a shared memory.
  22. Re: advertorial?[ Go to top ]

    From reading the article I get the sense that it has been payed for by Terracotta. There's nothing wrong with that but imo it behooves theserverside to let the readers make an informed judgement of the form of bias that is involved in an article.

    Personally I'd be interested in the question how this could be implemented with a database and simple locking mechanism as a shared memory.
    Funny, and here I thought it was paid for by Mule. As for the DB, are you suggesting that the DB can work as shared memory? That we should scale out using the DB? Interesting. --Ari
  23. Re: advertorial?[ Go to top ]

    From reading the article I get the sense that it has been payed for by Terracotta. There's nothing wrong with that but imo it behooves theserverside to let the readers make an informed judgement of the form of bias that is involved in an article.

    Personally I'd be interested in the question how this could be implemented with a database and simple locking mechanism as a shared memory.
    The article was paid by TheServerSide.com, where I act as a contributing editor and occasional speaker at the TechTarget conferences. If you google the articles and newsposts I produce for TSS they give me a pretty wide range on the topics I discuss, from iPhone to GWT to MapReduce to case studies. The choice for Terracotta came from two angles: 1. I am familiar with the technology, and I wanted to write about an implementation that would be done in-house, much like Google's, not one that uses a pre-canned technology like Hadoop or GridGain. That's for a potential third article on this topic. 2. There are two different ways of passing data around in MapReduce: signaling and data sharing. My first draft implementation uses threads and a common memory pool for the intermediate results' storage. I'm lazy. If I can use the same data structure for sharing across different instances of the program or different systems, why not? Terracotta would let me do that. So would GigaSpaces. The Terracotta guys are local to me, and I used to work with a bunch of them. My biases come from having shared sushi with them, not from a plug for their tech. You could s/Terracotta/Mule/ for the selection of the signaling bus in this hypothetical implementation, and apply the same reasoning. These are just technologies I'm familiar with and developed by people whom I know, respect, and have learn things from. I asked in the article what people would like to see in the following articles I'm drafting on this topic. The feedback from that is going to enrich the follow-ups. I'm looking into your suggestion for databases and simple locking, for example. Cheers! Eugene Imagine a cross between Indiana Jones and James Bond: The Tesla Testament Available from all major booksellers worldwide
  24. Re: advertorial?[ Go to top ]

    Thanks for clearing that up.
  25. Re: advertorial?[ Go to top ]

    I wanted to write about an implementation that would be done in-house, much like Google's, not one that uses a pre-canned technology like Hadoop or GridGain.
    Ok, I got it now too. So the idea was how to implement something like Hadoop or GridGain or GigaSpaces using Terracotta. That sounds indeed like an interesting exercise. Best, Nikita Ivanov. GridGain - Grid Computing Made Simple