Discussions

News: Web-Harvest, Web extraction tool released

  1. Web-Harvest, Web extraction tool released (15 messages)

    Web-Harvest is open-source Web data extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest is designed with the aim to better orchestrate existing technologies using pipeline processing.
    Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code - that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.
    Web-Harvest works by configuring each site in an XML file, that describes a sequence of processors to convert the HTML into usable XML. Message was edited by: joeo@enigmastation.com

    Threaded Messages (15)

  2. Looks kind of cool ... oops, it's GPL! Too bad, have to keep looking! Regards, Jens
  3. Jens, I'm curious about your response - are you looking at the tool as a user or as a producer of another product that might re-use it? Why is the license so important to your specific case? Floyd
  4. The second of course[ Go to top ]

    are you looking at the tool as a user or as a producer of another product that might re-use it?
    I'm not Jens, but the answer obviously has to be the latter. Any ISV that wants to sell licenses on a potential product containing a lib will only consider LGPL (or better) software.
  5. Re: The second of course[ Go to top ]

    Yep! Too bad I can't use it for my next project... Seems usefull for personal project anyway. Christian http://www.intelli-core.com
  6. What are you talking about?[ Go to top ]

    Web-Harvest has a BSD-license, not GPL.
  7. Re: What are you talking about?[ Go to top ]

    Web-Harvest has a BSD-license, not GPL.
    Oops, I could have sworn that the license page said GPL yesterday... Sorry about the confusion; I will take a look at it now. And Floyd, what Kurt replied is correct; in fact I was thinking of using a tool like Web-Harvest in an IntelliJ IDEA plugin I intend to write for the IDEA Plugin Contest, and plugins submitted there need to have a BSD or Apache license, so I couldn't have used any GPLed libraries. Regards, Jens
  8. good[ Go to top ]

    http://www.dohave.com offers data extraction service i used one time. it was amazing.
  9. What is it good for?[ Go to top ]

    I agree that GPL is not suitable for a tool like this. It might be the right choice for something like JBoss or this kind of software. But I think this is a tool which is meant to be included in some real application. Therefore the GPL or LPGL is a show stopper. Do you really want your application to be open source just because of this little application? Beside this the tool is yet another example for the abuse of XML. Who on earth is really thinking about using such a tool? I mean writing some code to read data from an URL and the extract some data in Java takes much less than trying to learn another XMLish language and write tons of line of XML. When I read the examples I was thinking that it must be part of Apache. Overengineering is not even the right word here. Only the guys from Apache can build huge complex applications for the very simple things. This kind of web data extraction is part of my daily business. It takes about 4-5 hours for any talented developer to write such a tool. Do I need the "flexibility" of XML here when I want to extract data from a website? What is the advantage of changing a few lines in XML rather than in a Java application and recompile? Sometimes I wish XML was never born.
  10. We can consider to use of RSS/Atom instead if possible. Doing web scraping need to handle tedious different thing for each web.
  11. Re: What is it good for?[ Go to top ]

    Who is thinking about using such a tool? Actually quite a few companies who need to collect (scrape) data of the Web. It's a specialised tool that offers efficient ways of doing this. Agreed, XML is horrible... but it's a (saddly popular) choice like many others to save a configuration. If you don't like it, you can probably also use the library directly in a Groovy script.
  12. I have been using a commercial tool from Kapowtech (http://www.kapowtech.com) for web scraping. It works pretty well, and has a GUI to design your scrapers. The GUI makes your work quick and easy, as you can see the results live, and you can also visually debug your scrapers.
  13. Nice looking project[ Go to top ]

    I worked for 2 years developping web scrappers, and it's certainly not a pretty job. The project seems to have the simple things down. From my experience, the hard part of web scrapping would be: 1/ Maintenance of site that you search, so actually having an example showing a unit test to verify that a site has not changed. Ok, it's more of a pain than actually hard, but when you maintain scrappers on >50 sites and have clients calling you if a site has changed. 2/ Javascript - As soon as there was some JS, it was often simpler to write a custom processor than to attempt and integrate rhino (js interpreter) 3/ Environment to actually develop the scrappers. These can be automatic where you give it a bunch on pages and let an algorithm locate what seems to be the dynamic elements of a page or a graphical tool. Since I hate developping UIs, I used some automatic data extracting algorithms (actually worked pretty well) like those implemented in RoadRunner (http://citeseer.ist.psu.edu/crescenzi01roadrunner.html)
  14. Re: Nice looking project[ Go to top ]

    Hi,
    RoadRunner (http://citeseer.ist.psu.edu/crescenzi01roadrunner.html)
    guess what? i work with RoadRunner author, and I confirm that the cited problems have a difficult-to-impossible solution, especially when scaling on dozens, hundrends or more web sites. About RoadRunner, register some alert on your favoured news searcher: something might come in the near future.
  15. Web scraping software[ Go to top ]

    You guys might want to take a look at this web scraping software. Is a tool I've been working on for about a year. You might find it useful as it focus the web scraping problem from a little bit different perspective than most web scraping solutions.

  16. It not support javascript[ Go to top ]

    I think it not support javascript. There are many commercial web collection tool, one choice is Fminer: web extract tool, and it present a FREE extraction project establishment service now.