Discussions

News: New Article: Using CI-Bayes

  1. New Article: Using CI-Bayes (5 messages)

    CI-Bayes is a way of using Bayesian statistical analysis to define a way to extract elements from a body of text, and then classifying those elements based on existing rules and prior analysis. It can be used for lexical analysis in general, and to be able to better identify spam in website postings. Joe Ottinger describes how to use the CI-Bayes Java project to examine incoming messages and apply a classifier. Read article

    Threaded Messages (5)

  2. machine learning algorithms[ Go to top ]

    another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.html
  3. Re: machine learning algorithms[ Go to top ]

    another interesting project on this topic: Apache Mahout, a apache lucene sub project -> http://cwiki.apache.org/MAHOUT/index.html
    Cool! I didn't know about Mahout. My only concern with Mahout is that it's still .. incomplete, I guess (no release to download, lots of incomplete algorithms, poor docs so far). I think CI-Bayes has a "cleaner" interface, which is probably the most important aspect to machine learning algorithms; they're easy enough to use if they're attuned to a specific problem domain, but they're madness for the general case. But there's lots of room in the problem domain for multiple solutions; this is cool stuff. Glad to see Mahout!
  4. automatic training[ Go to top ]

    @Joe, Do you know any good datasets to do automatic training? -> I'm trying different things to automatically categorize web articles (blogs, news,...) e.g. http://www.freebase.com data? thanks
  5. Re: automatic training[ Go to top ]

    I'm sorry, i don't know of any existing training sets - and I'm not sure what a training set would be for. If you're trying to train for spam, well, SpamAssassin has quite a testing corpus (and it's used to test CI-Bayes), but... other data categories would need their own corpora.
  6. Re: automatic training[ Go to top ]

    ok thanks i'm currently training it with web content/news at http://buzz.it-fabrik.at i modified your lucene wordlister to support the lucene stopWordFilter and a custom stopWord list robert