News: Hadoop expands data infrastructure, boosts business intelligence

  1. The big data that companies successfully transform into usable business intelligence (BI) is just the tip of a massive data-iceberg, according to Jonathan Seidman, solutions architect at Cloudera. At Big Data Techcon 2014, Seidman hosted a session called “Extending your data infrastructure with Hadoop,” in which he explained how Hadoop could help the enterprise tap into that potential business intelligence below the water.  “That data that’s getting thrown away can have a lot of value but it can be very difficult to fit that data into your data warehouse,” Seidman explained.

    The problem with big data is that there’s so much of it. Data centers simply don’t have the capacity to store it all. “Would you put a petabyte of data in your warehouse?” Seidman asked the audience. “It’s a good way to get fired,” a member shot back. For this reason, enterprises focus their energy on the data points that give a high return-on-byte, to use Seidman’s term.  That is, they capture and analyze the data that provides the most insight for the least amount of storage space. For example, a retailer would analyze the transactional dataset, focusing their attention on actual purchases. But Seidman pointed out that valuable data gets left out – behavioral, non-transactional data, in the retail example. “What if you don’t just want to know what the customer bought, but what they did on the site?” Seidman asked.

    Enter Apache Hadoop, an open source framework designed to store and process large data-sets. Seidman described this technology as “scalable, fault tolerant and distributed.” With this framework, enterprises can load raw data into it and impose a schema onto the data, afterward. “This makes it easy for iterative, agile types of development,” Seidman said. He added that it made a good sandbox for more exploratory types of analysis.

    The idea of just storing everything because one day we might need it is appealing, since writing to Hadoop is a nice low cost. But imposing a schema onto the data afterwards however, might not be a trivial as you make it out to be. Making sense of unstructured and unscrubbed data is no small task. But if you can pull it off, you might get a big advantage (but if it goes wrong, you're worse of than just having a traditional data warehouse). It's just something to keep in mind when you shift the complexity from writing to your EDW to reading from your HDFS.