Big Data Hadoop solutions with Hive, Mahout, HBase and Cassandra
By Jason Tee
Software engineers who are architecting big data solutions know that there is one technology that spans across SQL Databases, NoSQL Databases, unstructured data, document-oriented datastores and mega-processing for business analytics. If you guessed Hadoop, you’d be right. Hadoop is also a common denominator among giants like Amazon, Yahoo! AOL, Netflix, eBay, Microsoft, Google, Twitter and FaceBook. IBM is even on the bandwagon, promoting Hadoop for enterprise analytics. This open source framework is so ubiquitous that it’s surprising to think it’s only been a real player on the scene for about 5 years.
The future of Hadoop
To understand what’s been going on in the past couple of years, we turned to Chuck Lam, the author of Hadoop in Action. Chuck says Hadoop hasn’t been resting on its laurels. "The whole ecosystem has definitely evolved and changed a lot. Now, there’s even an official 1.0 version. Even more important, the underlying programming model for MapReduce has been revamped and has changed quite a bit." In general, these changes have been for the better. The direction of development has made the framework easier to use and deploy in the enterprise, addressing issues like security that are always at the forefront of concerns for risk-averse organization.
Benefits that keep getting better include high levels of scalability. Distributed computing in this framework means adding more and more data without changing how you add it. There’s no need to change formats or mess around with how jobs are written or which apps get the job done. You just add more nodes as you go. You don’t have to be picky about the types of data you store or where it originates. Schema-less is the name of the game. The framework’s parallel computing capability also makes more efficient use of commodity server storage space. This means enterprises can keep and use more of their data. If any single node implodes, that’s OK. The system fails over without losing your data and without degrading performance.
Complimentary Hadoop technologies
Hadoop solutions are also more flexible now, allowing businesses to do more things with more types of data. This enrichment is occurring through many of Hadoop’s companion projects including languages like Pig and scalable solutions like:
- Hive (data warehousing)
- Mahout (machine learning and data mining)
- HBase (structured storage for large tables)
- Cassandra (multi-master database)
Of course, it’s not always sunshine and roses with this type of solution. Lam says the main pitfalls have to do with making assumptions. In other words, the fault lies not in our system but in ourselves. "New technology is not a panacea for every problem. As easy as these NoSQL things are, you do need to really understand the problem you are trying to solve at a deeper level." That may mean taking a closer look at your algorithms rather than just throwing stuff at MapReduce and expecting Hadoop to scale automatically no matter what. The data use patterns affect how you scale – especially if the usage is not even. Then, linear scaling may not work. Again, the issue isn’t with Hadoop itself. Lam believes the tools in place are mature enough for enterprise. It’s simply a matter of ensuring IT administrators are familiar with these tools, and the software architects who are leveraging Hadoop understand how to apply and use the technology effectively.
Have you found the latest versions of Hadoop to be more useful in helping you work with big data? Let us know what you think.
in Action By Chuck Lam
NoSQL Distilled By Martin Fowler
MongoDB: The Definitive Guide By Michael Dirolf
MongoDB in Action By Kyle Banker
Taming The Big Data Tidal Wave By Bill Franks
The Well-Grounded Java Developer By Martijn Verburg
23 Apr 2013