Discussions

News: MapReduce Part II

  1. MapReduce Part II (17 messages)

    MapReduce is a distributed programming model intended for parallel processing of massive amounts of data. Eugene Ciurana's MapReduce Part II article describes a MapReduce implementation built with off-the-shelf, open-source software components. The article showcases MapReduce and guides you through how to write resource-oriented applications for the Mule integration platform as a side effect of its implementation. The article compares this approach with other MapReduce frameworks and distributed processing techniques, highlighting its time-to-market, operational and application development advantages.

    Threaded Messages (17)

  2. A case of masochism[ Go to top ]

    I am a bit shocked after reading this article. I consider myself an expert in distributed computing, and MapReduce in particular, but even I have a hard time going through tons of diagrams and code samples listed in this article. At this day an age, I consider it masochistic to implement MapReduce from scratch. Distributed programming is hard and Compute Grid Products, such as GridGain for example, provide such features as elegant MapReduce API, fault tolerance, on-demand scalability, load balancing, scheduling and job collision resolution, peer class loading and deployment, node discovery and communication, etc... right out of the box. This article is begging for question: why not download GridGain and rewrite the whole thing in a page or two? Best, Dmitriy Setrakyan GridGain - Grid Computing Made Simple
  3. Re: A case of masochism[ Go to top ]

    At this day an age, I consider it masochistic to implement MapReduce from scratch. Distributed programming is hard and Compute Grid Products, such as GridGain for example, provide such features as elegant MapReduce API, fault tolerance, on-demand scalability, load balancing, scheduling and job collision resolution, peer class loading and deployment, node discovery and communication, etc... right out of the box.

    This article is begging for question: why not download GridGain and rewrite the whole thing in a page or two?

    Best,
    Dmitriy Setrakyan
    GridGain - Grid Computing Made Simple
    The purpose of the article is to show how you'd do the same implementation using software that you're likely to have in production already. The explanations for this were about Mule, not about MapReduce. The actual implementation is very simple; it's significantly simpler than Hadoop, and possibly in par with GridGain, if not simpler. It's simpler to sell a framework in production based on things that you already have in-house than introduce new things; that's all. Plus this way you can see how all the magic happens. As for fault-tolerance, on-demand scalability, load balancing, etc. those are already built into the Mule backbones I have in production. All I have to do is deploy the mapping/reduction .jar file to each and define an entry point for the sample data. Now... peer class loading and deployment, node discovery sound very interesting (communication is a given with what I have, though). Those I'd like to explore in further detail. Last, the article is explaining how the magic portions of MapReduce work, i.e. the kind of stuff that GridGain or Hadoop implement, as well as the application itself. If I take as a given that the end-user already knows that then this is also a one or two page deal. So... I'm planning a future article about dedicated grid computing networks; Hadoop is part of it... GridGain could be as well, and have a mano-a-mano. I'll let you know how that works out. Cheers, E MapReduce Made Simple
  4. Re: A case of masochism[ Go to top ]

    I'm planning a future article about dedicated grid computing networks; Hadoop is part of it... GridGain could be as well, and have a mano-a-mano. I'll let you know how that works out.
    I think a good comparison would be great and very beneficial to the community. There are significant differenced between products and they are hardly competing, but there is a small intersection on the MapReduce paradigm that could be worth comparing (I blogged about it here). Also, GridGain already comes with Mule discovery and communication (it can basically run on top of Mule), so it should be a matter of configuration for you. Best, Dmitriy Setrakyan GridGain - Grid Computing Made Simple
  5. Misleading...[ Go to top ]

    As for fault-tolerance, on-demand scalability, load balancing, etc. those are already built into the Mule backbones I have in production. All I have to do is deploy the mapping/reduction .jar file to each and define an entry point for the sample data.
    I've seen this trend of thought literally 3-4 times from our clients... "We already have it!" – they usually proclaim. However, once we start digging in what they “have” it turns out that fault tolerance, load balancing, etc. mean completely, 100% different things in messaging middleware like Mule (or plain JMS as it was in couple of cases) than in grid middleware. The good analogy of why this is misleading is like saying “Well, we use TCP/IP that is a reliable network protocol so our JMS implementation automatically supports guaranteed delivery, transactions, pub/sub, etc.” See the fallacy :) Btw, great effort in popularizing grid computing… I wish I would have more time for something like that. Best, Nikita Ivanov. GridGain – Grid Computing Made Simple
  6. I smell jealousy[ Go to top ]

    @Nikita and @Dimitry: It's kind of hard taking your non-constructive comments coming from a "grid-stack" that hawks their wares along with support/consulting services - commercial enterprise perhaps? The article was a great look into how a lot of this works at its basic components and was not meant to compete or threaten your model. This was educational in nature and I applaud the amount of work that went into it. Sorry gents, but I think a more constructive approach to commenting on the article would have been taken better. We can punch holes in anything, including GridGain, which has its "issues" too. Its mighty easy to point out flaws as opposed to praising what someone was trying to accomplish. Food for thought gentlemen?
  7. Re: I smell jealousy[ Go to top ]

    @Jeff, Thanks for your comments. I think you got to the core of the article: this is about explaining how a MapReduce system could work and how to deploy it quickly if the infrastructure is already in place. I mentioned in the article that the biggest barrier to entry for Hadoop/GridGain/whatever will be adoption into the existing infrastructure. No system will come through the door without investing evaluation time. Building a working MapReduce system from existing components shifts the focus to dev/QA/deployment by eliminating the 90 - 180 days sales cycle and POC. Cheers, Eugene MapReduce Made From Scratch - Daily!
  8. Manual Sunset :)[ Go to top ]

    This is the saying we have in Russia (“Manual Sunset”) for something that... should not be done manually. Fit the article perfectly in my opinion. @Jeff: I don't praise or critique on something automatically. This is a technology-oriented forum and not a friendly buddy get-together club. I prefer quick and raw responses over politically correct garbage. At least this way readers will learn something... so, grow a thicker skin. I guarantee that you have learned at least two points of view so far: mine and Euegen's. I maintain my point that this article doesn't have any significant education value or anything else except for a showing off that Eugene knows Mule and remembers Awk. “Designs” like that (which is ridiculous over-complication) are the reasons I have to go and speak to user groups and conferences almost twice a month to dispel the very myths they are creating. What value does it bring to scotch-tape and chewing-gum together a toy project using something that happened to lay around the tooling box? It shows that Eugene is a really crafty guy and really knows what he's doing. To the rest of us – dunno. You can build the same thing with GridGain, GigaSpaces, or even Terracotta probably 5-10x times quicker and with at least 5x times more features. And finally, I think the whole article contradicts Eugene's idea that building this POC is an easier sell for grid computing systems. Really?!? In my experience (your mileage can vary) that would kill any follow up project for sure as it shows that utilizing grid computing system is a cumbersome, tidies, convoluted, piece-meal approach – which, in real life, it is not. Best, Nikita Ivanov.
  9. Re: Manual Sunset :)[ Go to top ]

    This is a technology-oriented forum and not a friendly buddy get-together club. I prefer quick and raw responses over politically correct garbage. At least this way readers will learn something... so, grow a thicker skin. I guarantee that you have learned at least two points of view so far: mine and Euegen's.
    Good...if you prefer raw...here you go... Well if its not a buddy get together club, bringing your "buddy" over to bash is called...what? This was a theoretical paper that gives folks a look under the hood on how this stuff works. As an analogy, it is as if Eugene was writing a paper on how a motor works and shows you can build one with a magnet, wire, and few parts. But you and your buddy who are an engine manufacturer saying, "Why the hell would you build a motor when you can buy my engine! Building a motor is too manual! Buy mine! I know better because I build them for a living!" Yeah people can use your software but they lose the ability to really see how something works. Why do it manually from this article's perspective? Because it teaches people something, which in turn gives them a better angle to understand your stuff...and may actually bring them to you.
  10. Agree[ Go to top ]

    Hi Jeff, I see your point. And I agree in large. My disagreement is mostly about the way Eugene went about it. Best, Nikita Ivanov. GridGain Systems
  11. Re: Agree[ Go to top ]

    Hi Jeff,
    I see your point. And I agree in large. My disagreement is mostly about the way Eugene went about it.
    If the goal is to explain how to build a MapReduce system and explain the internals, how would you have gone about it? That's the point of the article. Otherwise I'm suspect and think that you want to sell GridGain. That's why I want to see how you'd do this with GridGain vs. Mule focusing on the busiiness objects. Cheers, E
  12. Re: I smell jealousy[ Go to top ]

    Hey Sorry to start with a first post on a semi-flame topic, but as a newcomer to Mule I'm always interested to find some demonstrations of how flexible it is and what could be achieved with it using minimal third party dependencies - I think this article is quite successful on that point. So thanks for sharing and I hope more Mule test cases will be exposed soon :) N.
  13. Re: Misleading...[ Go to top ]

    Well, this configuration is already running in production, so there is no discussion. We have an aphorism in Mexico: If I say that the mule is brown it's because I have its hairs in my hand. Cheers, E MapReduce Made Simple - open-source!
  14. GridGain invitation[ Go to top ]

    Howdy, This is an invitation to the GridGain gang... would you guys be up to demonstrate all the code that I'd need to build the same map/reduce system used in the example from the article, and post it here? I'm interested in whether it's possible to build the same thing with as few lines of code and no GridGain-specific API calls as the Dioscuri sample implementation does. Your comparison system should be built with only J2SE classes, allowing for one call to the GridGain API for setup (like the Dioscuri sample does in one class). Don't include the generation of the final result -- that's a discussion for the next article and the Dioscuri sample uses Terracotta for aggregation instead of the file system. Please do include the same number of processing stages using the same algorithms and heuristics as in the sample app. Once the system is built, I'd be curious to measure the completion time for the same stages in both systems using the same number of nodes in any topology that we chose. Download all the source files for the sample in the article. This way we can all put the arguments to rest and learn something in the process. GridGain may outperform Dioscuri -- but, is it easier to program the app than just deploying POJOs, though? Thanks! Eugene Non-stop action. A vulnerable hero. A quest to save the world. It's the most exciting novel of the decade and Amazon best-seller Dec. 2007: The Tesla Testament ISBN: 1-4116-7317-4 - BISAC: FIC031000
  15. Accepted[ Go to top ]

    Well, I was about to suggest the same :) I can’t guarantee we can jump on it right this minute but let me see who can spend some time to sketch it up. Eugene, I hope you don’t mind if someone from our team will contact you directly should we have specific questions. It could be an interesting blog post (to compare two approaches) and we can post results on TSS for everyone to see… Best, Nikita Ivanov. GridGain Systems
  16. Re: Accepted[ Go to top ]

    Well, I was about to suggest the same :) I can't guarantee we can jump on it right this minute but let me see who can spend some time to sketch it up. Eugene, I hope you don't mind if someone from our team will contact you directly should we have specific questions. It could be an interesting blog post (to compare two approaches) and we can post results on TSS for everyone to see...

    Best,
    Nikita Ivanov.
    GridGain Systems
    Awesome -- I'm looking forward to it. Thanks! E
  17. Great Article!![ Go to top ]

    This article provides insight in to MapReduce internals. I was looking for this information and yes, I went through GridGain as well, which I would say is an excellent product out there. However, I found what I seek in this article, how this really happens. It is true that GridGain provides a sophisticated API with lots of features, including extensible SPIs, but I see value in this article as well, because it tells us (tech folks who want to learn how it happens) what is being done inside such a system. Thanks Eugene! And congrats on your excellent article.
  18. Re: MapReduce Part II[ Go to top ]

    Thanks for this article. I worked in a Bioinformatics company for their IT group. The aim of the project was to run a BLAST algorithm for the input gene sequences against huge datasets (each dataset is for a specific purpose : Known toxins, Known allergens, Patented sequences). The aim of the exercise was to eliminate research on genes that might be toxic, allergic or have IP infringements. I inherited some part of the system and developed the others. There was a shell script that ran (similar to the AWK job that you mentioned) on input sequences for some cleanup, Orchestrator that did the multithreaded dance to Blast the input sequences and produce intermediate results. There was a parser that parses the blast results. The orchestrator and its helper classes were really cumbersome and heavy and hard to maintain. Few months after that project I read MapReduce article by Sanjay Ghemawat(?) from Google. I was planning to redo the whole engine using a MapReduce semantics. However I moved away. It is in this background that i read your article. I appreciate the right choice of tools. I got to know that Mule can be used in a big way to do the heavy lifting of synchronizing multiple mappers and reducers and working with endpoints. We used a processor farm and NFS mount for sharing the files. It later moved to make use of Sun Grid I guess. Thanks again for such a wonderful article! Sridhar Visvanath