-
New Article: Using CI-Bayes (5 messages)
- Posted by: Nuno Teixeira
- Posted on: February 17 2009 12:25 EST
CI-Bayes is a way of using Bayesian statistical analysis to define a way to extract elements from a body of text, and then classifying those elements based on existing rules and prior analysis. It can be used for lexical analysis in general, and to be able to better identify spam in website postings. Joe Ottinger describes how to use the CI-Bayes Java project to examine incoming messages and apply a classifier. Read articleThreaded Messages (5)
- machine learning algorithms by Robert Starzer on February 23 2009 16:35 EST
- Re: machine learning algorithms by Joseph Ottinger on February 24 2009 10:34 EST
-
automatic training by Robert Starzer on February 27 2009 05:58 EST
-
Re: automatic training by Joseph Ottinger on March 04 2009 08:16 EST
- Re: automatic training by Robert Starzer on March 04 2009 05:26 EST
-
Re: automatic training by Joseph Ottinger on March 04 2009 08:16 EST
-
automatic training by Robert Starzer on February 27 2009 05:58 EST
- Re: machine learning algorithms by Joseph Ottinger on February 24 2009 10:34 EST
-
machine learning algorithms[ Go to top ]
- Posted by: Robert Starzer
- Posted on: February 23 2009 16:35 EST
- in response to Nuno Teixeira
another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.html -
Re: machine learning algorithms[ Go to top ]
- Posted by: Joseph Ottinger
- Posted on: February 24 2009 10:34 EST
- in response to Robert Starzer
another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.html
Cool! I didn't know about Mahout. My only concern with Mahout is that it's still .. incomplete, I guess (no release to download, lots of incomplete algorithms, poor docs so far). I think CI-Bayes has a "cleaner" interface, which is probably the most important aspect to machine learning algorithms; they're easy enough to use if they're attuned to a specific problem domain, but they're madness for the general case. But there's lots of room in the problem domain for multiple solutions; this is cool stuff. Glad to see Mahout! -
automatic training[ Go to top ]
- Posted by: Robert Starzer
- Posted on: February 27 2009 05:58 EST
- in response to Joseph Ottinger
@Joe, Do you know any good datasets to do automatic training? -> I'm trying different things to automatically categorize web articles (blogs, news,...) e.g. http://www.freebase.com data? thanks -
Re: automatic training[ Go to top ]
- Posted by: Joseph Ottinger
- Posted on: March 04 2009 08:16 EST
- in response to Robert Starzer
I'm sorry, i don't know of any existing training sets - and I'm not sure what a training set would be for. If you're trying to train for spam, well, SpamAssassin has quite a testing corpus (and it's used to test CI-Bayes), but... other data categories would need their own corpora. -
Re: automatic training[ Go to top ]
- Posted by: Robert Starzer
- Posted on: March 04 2009 17:26 EST
- in response to Joseph Ottinger
ok thanks i'm currently training it with web content/news at http://buzz.it-fabrik.at i modified your lucene wordlister to support the lucene stopWordFilter and a custom stopWord list robert