Hadoop is a great piece of software. It is not original but that certainly does not take away its glory. It builds on parallel processing, a concept that’s been around for decades. Although conceptually unoriginal, Hadoop shows the power of being free and open (as in beer!) and most of all shows about what usability is all about. It succeeded where most other parallel processing frameworks failed. So, now you know that I’m not a hater.
On the contrary, I think Hadoop is amazing. But, it does not justify some blatant failures on the part of Hadoop, may it be architectural, conceptual or even documentation wise. Hadoop’s popularity should not shield it from the need to re-enginer and re-work problems in the Hadoop implementation. The point below are based on months of exploring and hacking around Hadoop. Do dig in.
- Did I hear someone say “Data Locality”?
- Hadoop harps over and over again on data locality. In some workshops conducted by Hadoop milkers, they just went on and on about this. They say whenever possible, Hadoop will attempt to start a task on a block of data that is stored locally on that node via HDFS. This sounds like a super feature, doesn’t it? It saves so much of bandwidth without having to transfer TBs of data, right? Hellll, no. It does not. What this means is that first you have to figure out a way of getting data into HDFS, the Hadoop Distributed File System. This is non trivial, unless you live in the last decade and all your data exists as files. Assuming that you do, let’s transfer the TBs of data over to HDFS. Now, it will start doing it’s whole “data locality” thing. Ermm, OK. Am I hit by a wave of brilliance or isn’t it what’s is supposed to do anyway? Let’s get our facts straight. To use Hadoop, our problem should be able to execute in parallel. If the problem or a at least a sub-problem can’t be parallelized it won’t gain much out of Hadoop. This means the task algorithm is independent of any specific part of the data it processes. Further simplifying this would be saying, any task can process any section of the data. So, doesn’t that mean the “data locality” thing is the obvious thing to do? Why, would the Hadoop developers even write some code that would make a task process data in another node unless something goes horribly wrong. The feature would be if it was doing otherwise! If a task has finished operating on the node’s local data and then would transfer data from another node and process this data, that would be a worthy feature of the conundrum. At least that would be worthy of noise.
Read the rest of the article at the following URL: