An architect's guide: How to use big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
The big data world just keeps getting bigger. As the volume and variety of data expands, TheServerSide readers are wondering what to expect. Dr. Sriram Mohan is an associate professor of computer science and software engineering at Rose-Hulman Institute of Technology. He also currently works as a senior consultant for Avalon Consulting's Big Data Solution practice. With both academic and real-world experience under his belt, he is the right person to ask about 2014 big data trends for the enterprise. Here are some of the insights he had to offer.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Hadoop won't be able to handle big data alone
According to Sriram, "Hadoop and the MapReduce paradigm is definitely one way to address the problem of big data. But one thing you need to keep in mind in all of this is that Hadoop, as it currently stands, is only good for batch processing. Sooner or later, we need to be able to handle this data in real time, as well." Sriram -- a former Hadoop consultant -- isn't claiming this ubiquitous platform is slow. A large chunk of data might be processed in under a minute using such a powerful framework, but that's not always good enough. What's being done to correct this issue?
Shaun Connolly, VP of corporate strategy at Hortonworks, noted that Hadoop is getting faster and more versatile all the time. "What we're clearly getting asked for is optimization for NoSQL within Hadoop. Instead of being batch-oriented, it can take advantage of memory processing so requests come back more quickly. With YARN, you can actually do interactive querying that is more memory-based." Beyond this, there's an emerging wave of streaming analytics tools or processes relying on technologies like Storm that developers can plug into Hadoop with the new YARN structure. Today, big data users who work with Hadoop are looking at near real-time performance. However, this isn't 100% real time -- a distinction that matters when organizations use computers to make split-second decisions long before an analytics report could be digested by humans.
That's where Lambda architecture comes in. It permits organizations to deal with increments of critical data separately from the bulk of their data. Most of the data goes to the batch processing system, while a separate "speed layer" handles data in real time. The NoSQL databases (in their many flavors) all have their place in the ecosystem as well, since they offer specialized tools for managing data to fit specific use cases.
Integration will be essential, but no one tool will work for everyone
Speaking of giving Hadoop a helping hand, well-designed tools are proliferating in the big data space at a remarkable rate. ElasticSearch, Pentaho, and many other tools cover different niches in the big data ecosystem. But getting them to play well with each other is an important next step. Unless and until this happens, managing big data will be a hit-or-miss proposition.
Of course, this doesn't mean one integrated product will ever fit all business models. Data comes in many forms and every organization wants to do something different with that information. Organizations will need a variety of ways to handle their data, depending on the source of the data, the format, why they are collecting it, how they want to store it, how they want to analyze it, and how fast they need to process it. Hopefully, integration will occur while still maintaining modularity. This will permit enterprises to build the right tools for their use cases without reinventing the wheel every time.
Software engineers with big data know-how will be in high demand
Mohan pointed out that one of the most significant challenges in the big data space has to do with the miniscule talent pool. "The number of people who have experience in these areas is not very high." This doesn't mean software engineers need to go to school and earn a doctorate; technology workers don't need a PhD to understand big data. However, they do need to acquire knowledge and specialized skills. According to Sriram, this goal is achievable by any software engineer who is willing to put in the time and effort. The classroom isn't necessarily the only starting place. Experience with trying to achieve scale with a relational database and with making the transition to non-relational databases both serve as great foundations for grasping the big data problem.
Dr. Mohan is doing his part to prepare today's software engineers for the work world of the future. He will be offering two educational opportunities at Big Data TechCon in Boston: Data Transfer Tools for Hadoop, and Introduction to MapReduce. For those who want to be in high demand in the employment marketplace in the coming years, the time to start is now.