Discussions

News: Web-Harvest 1.0 Released, pulls data from HTTP pages

  1. Web-Harvest is open-source web extraction tool, written in Java. It assembles different technologies for text and xml processing, including XQuery, XSLT and regular expressions. Besides the set of predefined processors, Web-Harvest integrates scripting capability, giving the power of well-known script languages. New version brings lot of improvements and fixes. Most important is introduction of Graphical User Interface, which greatly eases development and testing process. Besides BeanShell, two other scripting languages - Javascript and Groovy are now supported. The number of bug fixes and performance improvements are included as well.
    Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code - that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client. Every Web site and every Web page is composed using some logic. It is therefore needed to describe reverse process - how to fetch desired data from the mixed content. Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one. This can be best explained using the simple configuration fragment: When Web-Harvest executes this part of configuration, the following steps occur:
    1. http processor downloads content from the specified URL.
    2. html-to-xml processor cleans up that HTML producing XHTML content.
    3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.
    Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling.

    Threaded Messages (6)

  2. Finally have GUI[ Go to top ]

    I used version 0.5 and wish there is a proper GUI so that I do not learn the XML syntax. Good that there is a GUI now !!! twit88.com
  3. Alternative Spiders?[ Go to top ]

    Hello, are there alternative spiders for parsing Websites, which maybe support also web2.0 (javaScripting) site?
  4. Eclipse based GUI?[ Go to top ]

    Congrats to the team! BTW, what was the reason to write a GUI from scratch? Wouldn't it be better/easier to have Eclipse plug-in that would rely on existing Eclipse libraries? This way it will we well integrated in most popular Java IDE
  5. Re: Eclipse based GUI?[ Go to top ]

    Well, the reason was intention the tool to be self-contained, not dependant on any other IDE. Of course, minus is lack of advanced features that Eclipse already has.
  6. Support for web 2.0 web sites ?[ Go to top ]

    congratulations, this is a great idea. does the web-harvest support sites with java script ? or does it only handle pure html sites ? - Ron
  7. Re: Support for web 2.0 web sites ?[ Go to top ]

    No, it doesn't interpret JavaSctipt. Works with pure content.