Java Development News:

Twitter data analysis helps beginners learn HBase, expert says

By Caroline de Lacvivier

28 Mar 2014 | TheServerSide.com

HBase is one of the fastest changing open source projects on the planet, according to Sameer Farooqui, freelance big data consultant and trainer. "The challenge, right now, is [that] when you learn HBase, you've got to keep up with all the rapid changes happening all the time." Developers that want to master HBase will have to commit to an ongoing process that has a lot of moving parts and time-sensitive learning materials.

Farooqui is helping to bring clarity to this complex distributed database at Big Data TechCon 2014, where he will be hosting a session called "Analyzing Tweets with HBase." Twitter's simple application programming interface and openly available user information makes it a convenient teaching tool and an ideal data set for analysis. "You can do a lot of things with it: anything from natural language processing to figuring out the subject of the Tweet to watching some message spread across the planet by looking at the geographic coordinates."

Be ready to commit to at least a month or two of pretty much focused effort to learn it.

Sameer Farooqui,
big data consultant and trainer

That said, the primary objective of the session is to learn HBase fundamentals, a worthwhile skill for any developer working with big data.

The pros and cons of learning HBase

HBase will be particularly interesting to those working with large amounts of data over a long period of time. Farooqui chose this technology, in part, because its high scalability could withstand Twitter's massive data set. "The largest HBase clusters are more than a thousand nodes tied together, and they're storing more than a petabyte of data in one single HBase cluster," Farooqui explained. "If you do eventually start opening up the Twitter firehose and dumping data into HBase, it would take a really long time before you fill up HBase's capabilities. A lot of other databases crumble long before that."

HBase doesn't actually require a deep, relational database background but, according to Farooqui, it does help to have a general systems background. For one thing, HBase does not require SQL, so being a relational database expert won't necessarily serve those starting out with HBase. Since it is a distributed database, it would help developers go into the learning process with some operations know-how. "It's a distributed database where hard drives might fail, servers might fail, switches might fail, and I think the background that's best for people to come from is hybrid development and operations."

Farooqui described HBase as a generalist's game, best for developers with wide-ranging experience. It would help to have a background in Java, Python, Linux, TCP, mounting file systems and diagnosing networking issues. "It's best for people who have a diverse IT background, and they're not really intimidated by jumping into any one of the deeper layers of HBase," Farooqui said.

That said, the best way to learn HBase is mainly through rigorous and ongoing commitment to the learning process. "Read as much as you can. Watch as many YouTube videos as you can, and go to as many in-person meet-ups as you can. Be ready to commit to at least a month or two of pretty much focused effort to learn it," Farooqui advised. He went on to caution developers against simply learning the data model. HBase is more complex than a receptacle in which to dump data. "You really have to understand the engine and its architecture and how the data gets stored all the way down to the disk."

Cassandra is HBase's most prominent competitor. It is one of the only other NoSQL databases that scales as well as HBase and has also reached around 1,000 nodes in production. However, the benefit of Cassandra is that it is simpler to use and therefore widely adopted. "The problem is, HBase has a lot more moving parts. There's a lot of complexity, more points where it can break." Farooqui advised companies looking to launch a big data project to start with Cassandra and then move to HBase, if the need arises.