Nutch as Platform?


News: Nutch as Platform?

  1. Nutch as Platform? (3 messages)

    Mark Watson has suggested that "Nutch is a platform for building more complex web applications and knowledge management applications." Nutch is described on its wiki as "open source web-search software," adding features for web-specific use. Mr. Watson didn't elaborate on how he envisioned it as a platform, but one idea (from your Humble Editor) might be that it would be a central information-gathering application, where people might (conceivably) put content into a repository, expose the content via a URL, and then submit that url into Nutch. What other uses could you see for Nutch as an application platform? The only problem with the above scenario is that Nutch would duplicate many of the features of JCR, so perhaps that's not what Mr. Watson is considering. What do you think of the idea?

    Threaded Messages (3)

  2. examples?[ Go to top ]

    It is a nice thought but is pointless unless we get some elaboration.
  3. Not really Nutch, Hadoop[ Go to top ]

    I thought it was fairly clear that he's really talking about Hadoop. Hadoop is a subproject of Nutch that implements the Google map-reduce distribution algorithm. As such, it could be the framework for a distributed application. He also mentions the plugin system, but I think that is less important.
  4. An example[ Go to top ]

    re: my Nutch blog post: I have a keen interest in both knowledge management (KM) and in artificial intelligence. I have been thinking of building a KM portal that was mostly a document repository with document clustering, semantic analysis, and search as features. Then I thought of turning this around: since Nutch is built on Lucene and Lucene index entries can contain arbitrary un-indexed fields, I thought of taking advantage of the scalability of Nutch and use Nutch as a platform that provides indexing, and store (as un-indexed fields) all category and semantic analysis meta data that would be calculated during indexing. A separate data store would be necessary for maintaining sets of index entries with the same categories. Nutch also has a good plugin architecture for customizing quieres, lots of plugins for handling different document types, RSS, etc., etc. So, that was my idea for using Nutch as a platform :-) -Mark Watson