, a project hosted on java.net, has released its first stable version. ci-bayes allows the use of a classifier to determine what classification a given object might fall into, given prior training, and provides multiple classifiers, hooks for persistence, and results for multiple classifications for each object tested.
ci-bayes is based off of the chapter on Bayesian classification from Toby Segaran's "Programming Collective Intelligence
," and has been ported from the original python with the explicit permission of the author.
ci-bayes is built with Maven 2, and has an explicit runtime dependency on javolution
; it provides factories for use with Spring 2, but those aren't required for runtime in the simplest case.
A simple example of how the classifier works might look like this:FisherClassifier fc=new FisherClassifierImpl();
fc.train("The quick brown fox jumps over the lazy dog's tail","good");
fc.train("Make money fast!", "bad");
String classification=fc.getClassification("money"); // should be "bad"Currently, ci-bayes uses the SpamAssassin
testing corpora for performance and accuracy testing. The methodology is fairly simple: it first trains itself according to the SpamAssassin conventions with seven out of ten corpora, then goes back through the training set, testing the remaining three corpora to see if the result matches what SpamAssassin generated.
It's able to run the classification tests in just over eleven seconds on a single CPU core, with a 98% match with SpamAssassin; given that SpamAssassin
and ci-bayes have different classification mechanisms and different functions, this is probably
acceptable for most usages. (SpamAssassin uses a neural network to analyze spam; it's not a strict bayesian classifier, so a 98% accuracy is - in my opinion - a marvelous result.)
The binary jar for ci-bayes-1.0-SNAPSHOT
is available on java.net.