News: New Article: Using CI-Bayes
CI-Bayes is a way of using Bayesian statistical analysis to define a way to extract elements from a body of text, and then classifying those elements based on existing rules and prior analysis. It can be used for lexical analysis in general, and to be able to better identify spam in website postings. Joe Ottinger describes how to use the CI-Bayes Java project to examine incoming messages and apply a classifier. Read article
- Posted by: Nuno Teixeira
- Posted on: February 17 2009 12:25 EST
- machine learning algorithms by Robert Starzer on February 23 2009 16:35 EST
another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.html
another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.htmlCool! I didn't know about Mahout. My only concern with Mahout is that it's still .. incomplete, I guess (no release to download, lots of incomplete algorithms, poor docs so far). I think CI-Bayes has a "cleaner" interface, which is probably the most important aspect to machine learning algorithms; they're easy enough to use if they're attuned to a specific problem domain, but they're madness for the general case. But there's lots of room in the problem domain for multiple solutions; this is cool stuff. Glad to see Mahout!
@Joe, Do you know any good datasets to do automatic training? -> I'm trying different things to automatically categorize web articles (blogs, news,...) e.g. http://www.freebase.com data? thanks
I'm sorry, i don't know of any existing training sets - and I'm not sure what a training set would be for. If you're trying to train for spam, well, SpamAssassin has quite a testing corpus (and it's used to test CI-Bayes), but... other data categories would need their own corpora.
ok thanks i'm currently training it with web content/news at http://buzz.it-fabrik.at i modified your lucene wordlister to support the lucene stopWordFilter and a custom stopWord list robert