The great thing about the information age is the fact that so many new streams of data are becoming available for organizations. But with all of these new volume streams, how do organizations manage them and obtain insights that are meaningful? Big data solutions have made processing massive amounts of data possible, but how are organizations making sense of what they process?
Organizations will need different ways of handling their data.
Dr. Sriram Mohan
According to Dr. Sriram Mohan, Associate Professor of Computer Science and Software Engineering at Rose-Hulman Institute of Technology, big data is simply a buzz word that was coined to describe the reality that there's a lot of data flowing into organizations quickly and that there are some very real challenges associated with handling it. Overused catchphrase or no, Big data is overwhelming the business world right now with a flood of information. What are the key systems or pieces of architecture that the enterprise should put in place to manage the rising tide of data?
Customizing solutions around core concepts
Sriram said there is no one-size-fits-all solution. "Organizations will need different ways of handling their data depending on the source of the data, the data format, why they are collecting it, how they want to store it, and how fast they need to process it." When an organization simply wants to archive data, sending it to a storage solution such as AWS' Glacier might be appropriate. However, most enterprises need to do more with at least some of their Big data. After data is captured, there are three main pieces of the data management puzzle:
- Analyzing the data (through a batch processing engine such as Hadoop and various API tools)
- Exposing the results of analysis to Business Intelligence tools for easy viewing
- Making the data searchable so that it can be readily queried to reveal new information
Ideally, organizations should find a way to do this with low latency. Time is of the essence when real-world decisions rely on real-time data.
A new architecture is evolving
NoSQL databases play an important role with their ability to store large volume of unstructured data for processing. Disambiguation and tagging tools that help organize and structure a wide variety of data for more accurate and complete analysis are helpful as well. Dr. Mohan pointed out that new concepts are also emerging to help with velocity—the third "v" in the big data challenge. Lambda architecture is a prime example. This approach to data processing offers businesses the ability to analyze data incrementally in real time while sending the bulk of data into a batch processing engine like Hadoop.
Nathan Marz, former head of the streaming compute team at Twitter and coauthor of Big data - Principles and best practices of scalable realtime data systems, said it's all about going back to basics. "Lambda architecture is an approach to building data systems from first principles." The starting point is to ask whether a relational database applies to all data problems. Can relations, tables, and primary keys hold the answers? What exactly is a data problem, anyway?
According to Nathan, that's not as clearly defined as one might suppose. But it all boils down to one thing. "What's the most general possible formulation for a data problem? It's really quite simple. Any data problem can be expressed as a function that takes in every piece of data you have as input. Query equals function of all data." This premise served as the starting point to provide a practical way to build the desired functionality in a scalable, up-to-date architecture that operates with low latency.
Lambda architecture in a nutshell
Lambda architecture has three layers. The batch layer, Hadoop, is where the totality of the data is stored and where MapReduce runs for batch processing. The speed layer, which can utilize a solution such as Storm, captures and computes new information in real time as it comes in. Marz put it simply: "Any time you need to look at data historically, that's when you use batch processing—whenever you need to look at all the data at once. But anything you need to do as the data comes in, that's what you use Storm for."
Once the information that was initially reviewed in Storm has been processed in MapReduce and become part of the entire dataset, it is discarded from the real time layer to make way for new data. The information from the batch layer and the speed layer are combined for querying in the serving layer. This middle layer might feature a massively parallel processing query engine such as BlinkDB or Cloudera Impala. With the division of labor in this architecture, enterprises can make decisions based on the most recent data without losing time to latency as they wait for the next batch to finish running.
Big data will never be solved by a single solution. However, divide and conquer is proving to be a winning strategy. Continued integration of data management tools will help enterprises surf their data instead of getting pulled under.
How are you tackling your big data problems? Let us know.