Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code - that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.Web-Harvest works by configuring each site in an XML file, that describes a sequence of processors to convert the HTML into usable XML. Message was edited by: joeo@enigmastation.com
-
Web-Harvest, Web extraction tool released (15 messages)
- Posted by: Vladimir Nikic
- Posted on: September 04 2006 06:40 EDT
Web-Harvest is open-source Web data extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest is designed with the aim to better orchestrate existing technologies using pipeline processing.Threaded Messages (15)
- Re: Web-Harvest, Web extraction tool released by Jens Voss on September 04 2006 10:24 EDT
- Re: Web-Harvest, Web extraction tool released by Floyd Marinescu on September 04 2006 11:02 EDT
-
The second of course by Kurt De Grave on September 04 2006 12:07 EDT
- Re: The second of course by Albert Albert on September 04 2006 12:31 EDT
-
The second of course by Kurt De Grave on September 04 2006 12:07 EDT
- What are you talking about? by Kai Virkki on September 04 2006 13:52 EDT
- Re: What are you talking about? by Jens Voss on September 05 2006 10:15 EDT
- good by ere wer on February 09 2009 02:44 EST
- Re: Web-Harvest, Web extraction tool released by Floyd Marinescu on September 04 2006 11:02 EDT
- What is it good for? by Andreas Mecky on September 04 2006 12:35 EDT
- umm it seems like an example of abused xml by Sutham Rojanusorn on September 07 2006 03:47 EDT
- Re: What is it good for? by Stephane Vaucher on September 11 2006 17:07 EDT
- Re: Web-Harvest, Web extraction tool released by Tero Vaananen on September 05 2006 09:05 EDT
- Nice looking project by Stephane Vaucher on September 05 2006 15:21 EDT
- Re: Nice looking project by Valerio Schiavoni on September 06 2006 08:14 EDT
- Web scraping software by juan soldi on April 19 2011 12:58 EDT
- It not support javascript by lee philips on June 20 2011 21:19 EDT
-
Re: Web-Harvest, Web extraction tool released[ Go to top ]
- Posted by: Jens Voss
- Posted on: September 04 2006 10:24 EDT
- in response to Vladimir Nikic
Looks kind of cool ... oops, it's GPL! Too bad, have to keep looking! Regards, Jens -
Re: Web-Harvest, Web extraction tool released[ Go to top ]
- Posted by: Floyd Marinescu
- Posted on: September 04 2006 11:02 EDT
- in response to Jens Voss
Jens, I'm curious about your response - are you looking at the tool as a user or as a producer of another product that might re-use it? Why is the license so important to your specific case? Floyd -
The second of course[ Go to top ]
- Posted by: Kurt De Grave
- Posted on: September 04 2006 12:07 EDT
- in response to Floyd Marinescu
are you looking at the tool as a user or as a producer of another product that might re-use it?
I'm not Jens, but the answer obviously has to be the latter. Any ISV that wants to sell licenses on a potential product containing a lib will only consider LGPL (or better) software. -
Re: The second of course[ Go to top ]
- Posted by: Albert Albert
- Posted on: September 04 2006 12:31 EDT
- in response to Kurt De Grave
Yep! Too bad I can't use it for my next project... Seems usefull for personal project anyway. Christian http://www.intelli-core.com -
What are you talking about?[ Go to top ]
- Posted by: Kai Virkki
- Posted on: September 04 2006 13:52 EDT
- in response to Jens Voss
Web-Harvest has a BSD-license, not GPL. -
Re: What are you talking about?[ Go to top ]
- Posted by: Jens Voss
- Posted on: September 05 2006 10:15 EDT
- in response to Kai Virkki
Web-Harvest has a BSD-license, not GPL.
Oops, I could have sworn that the license page said GPL yesterday... Sorry about the confusion; I will take a look at it now. And Floyd, what Kurt replied is correct; in fact I was thinking of using a tool like Web-Harvest in an IntelliJ IDEA plugin I intend to write for the IDEA Plugin Contest, and plugins submitted there need to have a BSD or Apache license, so I couldn't have used any GPLed libraries. Regards, Jens -
good[ Go to top ]
- Posted by: ere wer
- Posted on: February 09 2009 02:44 EST
- in response to Jens Voss
http://www.dohave.com offers data extraction service i used one time. it was amazing. -
What is it good for?[ Go to top ]
- Posted by: Andreas Mecky
- Posted on: September 04 2006 12:35 EDT
- in response to Vladimir Nikic
I agree that GPL is not suitable for a tool like this. It might be the right choice for something like JBoss or this kind of software. But I think this is a tool which is meant to be included in some real application. Therefore the GPL or LPGL is a show stopper. Do you really want your application to be open source just because of this little application? Beside this the tool is yet another example for the abuse of XML. Who on earth is really thinking about using such a tool? I mean writing some code to read data from an URL and the extract some data in Java takes much less than trying to learn another XMLish language and write tons of line of XML. When I read the examples I was thinking that it must be part of Apache. Overengineering is not even the right word here. Only the guys from Apache can build huge complex applications for the very simple things. This kind of web data extraction is part of my daily business. It takes about 4-5 hours for any talented developer to write such a tool. Do I need the "flexibility" of XML here when I want to extract data from a website? What is the advantage of changing a few lines in XML rather than in a Java application and recompile? Sometimes I wish XML was never born. -
umm it seems like an example of abused xml[ Go to top ]
- Posted by: Sutham Rojanusorn
- Posted on: September 07 2006 03:47 EDT
- in response to Andreas Mecky
We can consider to use of RSS/Atom instead if possible. Doing web scraping need to handle tedious different thing for each web. -
Re: What is it good for?[ Go to top ]
- Posted by: Stephane Vaucher
- Posted on: September 11 2006 17:07 EDT
- in response to Andreas Mecky
Who is thinking about using such a tool? Actually quite a few companies who need to collect (scrape) data of the Web. It's a specialised tool that offers efficient ways of doing this. Agreed, XML is horrible... but it's a (saddly popular) choice like many others to save a configuration. If you don't like it, you can probably also use the library directly in a Groovy script. -
Re: Web-Harvest, Web extraction tool released[ Go to top ]
- Posted by: Tero Vaananen
- Posted on: September 05 2006 09:05 EDT
- in response to Vladimir Nikic
I have been using a commercial tool from Kapowtech (http://www.kapowtech.com) for web scraping. It works pretty well, and has a GUI to design your scrapers. The GUI makes your work quick and easy, as you can see the results live, and you can also visually debug your scrapers. -
Nice looking project[ Go to top ]
- Posted by: Stephane Vaucher
- Posted on: September 05 2006 15:21 EDT
- in response to Vladimir Nikic
I worked for 2 years developping web scrappers, and it's certainly not a pretty job. The project seems to have the simple things down. From my experience, the hard part of web scrapping would be: 1/ Maintenance of site that you search, so actually having an example showing a unit test to verify that a site has not changed. Ok, it's more of a pain than actually hard, but when you maintain scrappers on >50 sites and have clients calling you if a site has changed. 2/ Javascript - As soon as there was some JS, it was often simpler to write a custom processor than to attempt and integrate rhino (js interpreter) 3/ Environment to actually develop the scrappers. These can be automatic where you give it a bunch on pages and let an algorithm locate what seems to be the dynamic elements of a page or a graphical tool. Since I hate developping UIs, I used some automatic data extracting algorithms (actually worked pretty well) like those implemented in RoadRunner (http://citeseer.ist.psu.edu/crescenzi01roadrunner.html) -
Re: Nice looking project[ Go to top ]
- Posted by: Valerio Schiavoni
- Posted on: September 06 2006 08:14 EDT
- in response to Stephane Vaucher
Hi,RoadRunner (http://citeseer.ist.psu.edu/crescenzi01roadrunner.html)
guess what? i work with RoadRunner author, and I confirm that the cited problems have a difficult-to-impossible solution, especially when scaling on dozens, hundrends or more web sites. About RoadRunner, register some alert on your favoured news searcher: something might come in the near future. -
Web scraping software[ Go to top ]
- Posted by: juan soldi
- Posted on: April 19 2011 12:58 EDT
- in response to Vladimir Nikic
You guys might want to take a look at this web scraping software. Is a tool I've been working on for about a year. You might find it useful as it focus the web scraping problem from a little bit different perspective than most web scraping solutions.
-
It not support javascript[ Go to top ]
- Posted by: lee philips
- Posted on: June 20 2011 21:19 EDT
- in response to Vladimir Nikic
I think it not support javascript. There are many commercial web collection tool, one choice is Fminer: web extract tool, and it present a FREE extraction project establishment service now.