This version of HBase-Writer has a new runtime dependency: ZooKeeper. This is because HBase-0.20.X now depends on ZooKeeper to manage configuration and connection information. This version has been tested on a few Heritrix2-2.0.2 crawls on Hadoop 0.20.1, HBase 0.20.1 and ZooKeeper 3.2.1. and works fine as far as my tests go. The main difference you will have to be aware of when upgrading from 0.19.x to 0.20.x are 2 things:
1. In the global sheet configuration for your heritrix job, There is no "master" address for HBaseWriterProcessor anymore. Instead you need to provide a comma-seprated list of zookeeper hosts that make up the zookeeper quorum (zkquorum). Heritrix will talk to ZooKeeper to determine the master address of HBase. This has been done by HBase in 0.20.x to avoid the Master node being a SPOF (single point of failure) Support for an alternate zk client port has been added as well..
2. You need to add the zookeeper.jar to the lib/ folder. The zookeeper jar is included with the HBase distribution, or you can download it from the OSM Archive Repository .
The other changes in this version were under-the-hood. The BatchUpdate API has been deprecated in HBase-0.20.x and HBase-Writer is now using the new Put/Get API from HBase to write and manage records when doing crawls. Feel free to create Issues if you want to see support added for something or if you have a bug to report. Thanks for checking it out and Enjoy! :)
To contribute or help in the development of HBase-Writer, please create an Issue on the project website and upload any patches for review.
Thanks to Questio.com for the support in releasing this project.
* HBase-Writer -Heritrix2 Processor plugin for writing web crawl output to hbase tables.
* Heritrix-HDFS-Writer -Heritrix2 Processor plugin for writing web crawl output to the hdfs filesystem.
* Heritrix2 - The Internet Archiver's very own crawler.
* HBase - A distributed 'BigTable' storage engine.
* ZooKeeper - A distributed configuration engine.
* Hadoop - HBase runs on top of the Hadoop distributed filesystem.