Discussions

News: Easy Screen-scraping with XQuery

  1. Easy Screen-scraping with XQuery (21 messages)

    While XQuery was designed for querying large document bases, it serves as a fine tool for transforming simple documents as well. This article shows how XQuery offers a fast and easy way to scrape HTML pages for the data you need. XQuery is the perfect tool for you if your goal is simplifying complex pages for display on small screens, or extracting elements from multiple pages to aggregate them together on a home-grown portal, or simply extracting data from Web pages because there is no other programmatic way to get the data.

    Related content:

    Threaded Messages (21)

  2. XSL + TagSoup[ Go to top ]

    I have implemented a "proxy portlet" (in JSR168 terms) that uses a similar strategy, but with XSL instead of XQuery and TagSoup instead of JTidy. TagSoup exposes a SAX interface, and it appears to be very fast. I can very much recommend it as a way to "scrape" apps, or as in my case, create a portlet which shows the contents of a web application.

    TagSoup can be found here:
    http://mercury.ccil.org/~cowan/XML/tagsoup/
  3. proxy portlet, nice idea[ Go to top ]

    Besides of all the nice projects mentioned like JTidy, Tagsoup and others, i really like the idea of a proxy portlet which should enable me to put every webapplication into a portal. Is this a common feature of all portal vendors or is it just inside the product whose name i ve forgotten written by Rickard and Co.?

    Marc
  4. proxy portlet, nice idea[ Go to top ]

    Besides of all the nice projects mentioned like JTidy, Tagsoup and others, i really like the idea of a proxy portlet which should enable me to put every webapplication into a portal. Is this a common feature of all portal vendors or is it just inside the product whose name i ve forgotten written by Rickard and Co.?Marc
    I think this is a fairly common thing to have, but the beauty of this particular approach is that it is very customizable (scraping through XSL can be arbitrarily complex) and very high-performant (due to TagSoup being an SAX-type parser).

    And yes, it is a very nice thing to have in terms of integration. No matter what kind of app the customers has, be it PHP, ASP, JSP, Perl, whatever, this approach can be used to (relatively) easily integrate it (we've adapted it to handle link rewriting, resource rewriting, JavaScript and CSS rewriting out of the box). We are right now using it combined with security solutions from Entrust to accomplish enterprise-wide secure integration of "legacy" webapps (many of which are J2EE webapps) at a large customer (40+ webapps, 50000+ users). Pretty nifty.
  5. XSL + TagSoup[ Go to top ]

    And how good are JTidy and TagSoup for the poorly conforming HTML? Yes, we’ve tested JTidy with our own pages :-) and it really works. But having more reliable stats would be great.

    Dmitry
    http://www.servletsuite.com
  6. I embedded JTidy into TestMaker about 6 months ago and have received good praise that JTidy is able to handle poorly formed html well. In my application, I use JTidy and XPath to parse the HTML received from a host for image tags. TestMaker then downloads the image tag references to emulate the behavior of a browser.

    At first I had problems using JTidy to parse HTML created by Microsoft Word and other Microsoft properties. The HTML is so verbose from these applications. I posted a problem report to the JTidy group and found an update a couple of days later that let JTidy work through the MS HTML without exception.

    JTidy is cool stuff. Combined with XPath it's really a breeze to write HTML parsing code.

    By the way, I'm now learning XQuery and really liking what I see. From my perspective XQuery looks like a much more usable SQL. To learn XQuery I recently (as in last weekend) embedded the Kawa XQuery parser compiler into TestMaker 4.3.1. Kawa compiles XQuery statements into Java Byte Codes. When TestMaker sees a file ending in .xql it assumes this is an XQuery and compiles-and-runs the query. The results of the query appear in the output panel.

    -Frank

    P.S. - Details on TestMaker are found at http://www.pushtotest.com
  7. TagSoup is really good[ Go to top ]

    Dmitry: I integrated TagSoup into TestMaker - my open source test tool, details at http://www.pushtotest.com - about a year ago with good results. In my application, once a Web page is loaded then I use JDOM, Jaxen, and TagSoup to find <img> tags. I parse the tag, and then load the image. This has been in TestMaker for about 9 months and I've heard positive feedback from users.

    -Frank
  8. What is a proxy portlet?[ Go to top ]

    What is a proxy portlet? Why would it require link rewriting?
  9. What is a proxy portlet?[ Go to top ]

    What is a proxy portlet? Why would it require link rewriting?
    It's a portlet that fetches the content of a given URL. And if that content contains links to other URL's whose content should also go through the proxy, then the URL's have to be rewritten so that when the end-client clicks on them the request will go to the portal/proxy portlet instead of the original URL, which will fetch the data. In effect, the proxied URL(/application) and all related resources will never be accessed directly by the client, and the client shouldn't even know that it exists, as all of the communication goes through the portlet. This means that links needs to be rewritten, resources (like "src" attributes on the img tag) needs to be rewritten, and CSS and JavaScripts need to be rewritten, so that all requests go through the portal.

    This is a way to automatically portletify legacy webapps, regardless of their implementation language.
  10. XSL + TagSoup[ Go to top ]

    Hi Rickard,

    I am interested in using a proxy portlet as I need to embed external sites to my application. Are you sharing?

    Mathieu
  11. Would ya'll say that XQuery would be an excellent choice for writing automated tests? One advantage I see is that my tests will be implementation-agnostic.
  12. We recently did an XQuery Developer survey indicating a significant number of developers are using or plan to use XML Query even before it becomes a standard because of the language's ability to simplify data extraction. This is discussed in today's edition of eWeek.

    XQuery Developer Survery
  13. We used HTTPUNIT for many things.[ Go to top ]

    I ever tried to write an article on using HTTPUNIT for not only this purpose to IBM developerWorks, but I was rejected.

    At our company, we have an application that needs to go to many systems such as billing, Dataware House, etc. to retrieve the information and put it together as reports (text, pdf, html). Some of these systems don't have APIs, but they do have web accesses. We use HTTPUNIT to access these systems and it easy to use the Table Object from the package. The advantages of HTTPUNIT are that it can be programmingly interactive (handling login/password, wiggle to specific pages) and it can handle javascripts. I don't know how the method introduced in this article to handle these things.

    I think it is probably a good idea to use HTTPUNIT to retrieve the pages first, then use the method introduced here combining with some APIs from HTTPUNIT to accomplish more complicated problems.
  14. More Advanced Use Cases[ Go to top ]

    Interesting use of XQuery. I'm wondering, however, if anyone has done any more advanced scraping (not necessarily using XQuery).

    Specifically, what happens when the data is spread over multiple pages? To use the example from the article, say you have a page that lists all the publishers, and there is a link for each publisher. Each publisher link takes you to a list of all books for that publisher. How do you combine that data?

    Also, what happens when the data is contained in a web app that makes heavy use of DHTML and JavaScript? In that case it's no longer sufficient to get an InputStream and feed it to JTidy, because the JavaScript itself is supposed to modify the DOM.

    I'm just interested in knowing if anyone else has encountered these issues, and if so, how they were dealt with. For the record, we solved the first issue by building a custom scraping framework. We solved the second issue by building said framework on top of one of the web app unit testing frameworks (HtmlUnit).
  15. Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.
  16. SWExplorerAutomation[ Go to top ]

    Very interesting stuff. It looks to be very close to the custom solution we came up with, except that our solution had to be platform-agnostic and therefore is implemented in Java rather than C# (or whatever .NET language it's implemented in) and uses HtmlUnit rather than IE. Since we don't use IE, we don't have the snazzy interface, either ;-)
  17. SWExplorerAutomation[ Go to top ]

    IE can be easily hidden. In SWExplorerAutomation you can specify how to run IE: visible or not. Java is excellent platform, but not everything can be implemented using Java. There are many situations in which HttpUnit will not work. SWExplorerAutomation will work with almost all sites developed using HTML and DHTML.
  18. how about embedding mozilla[ Go to top ]

    Very interesting stuff. It looks to be very close to the custom solution we came up with, except that our solution had to be platform-agnostic and therefore is implemented in Java rather than C# (or whatever .NET language it's implemented in) and uses HtmlUnit rather than IE. Since we don't use IE, we don't have the snazzy interface, either ;-)

    I has been using java based httpclient http://jakarta.apache.org/commons/httpclient/
    and htmlparser http://htmlparser.sourceforge.net/ to do screen-scraping for a while. not bad though.
    But thinking of embedding mozilla to get DOM structure and apply XSLT or XQUERY to it, Anyone has done this so far?
  19. Here's a list I put together some time ago:

    http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java/view

    Of course, I didn't consider the combination of XQuery and JTidy.

    Carlos
  20. Easy Screen-scraping with XQuery[ Go to top ]

    This is by the way the kind of things that are trivial to do with the open source Orbeon PresentationServer (OPS) without writing a line of Java: combine the URL generator (which incorporates JTidy), the XQuery processor (based on Qexo, but we just moved the default to using Saxon's XQuery processor in CVS), an HTML serializer, and off you go! It's all XML, there is not a single line of Java to write. We've had forever an example getting the last CNN headline using XSLT instead of XQuery:

      http://www.orbeon.com/ops/goto-example/url

    And another one implementing a "spell checker" by querying Google and checking what words Google provides suggestions for:

      http://www.orbeon.com/ops/goto-example/spellchecker

    I just committed to CVS the "Yahoo! Weather" example based on XQuery.

    The usual info about OPS:

      http://www.orbeon.com/software/
      http://forge.objectweb.org/projects/ops/
  21. Orbeon, JSF, JSR 208[ Go to top ]

    Thanks for the pointer to Orbeon. The demo showed me how I can do forms, page flow, and some workflow in a browser/html environment. Since you seem to be in the "integration" space I'm wondering if you would give your opinion on Orbeon, JSF, JSR 208 and the other integration systems. Anything that helps me to get a handle on these projects would be most appreciated.

    -Frank
  22. Orbeon, JSF, JSR 208[ Go to top ]

    Frank, OPS mostly addresses the presentation layer aspect of integration. I cannot discuss here JSF (except to say that it is not at all related to integration) or JSR 208, but feel free to start a discussion in ops-users:

      http://forge.objectweb.org/mail/?group_id=168

    Oh, and I should mention that I posted a blog entry as a follow-up to Brian Goetz's article discussed here, complete with an illustration of how you can do the same with XSLT (not that one way or the other is better, in this case the two solutions really are very similar):

      http://www.orbeon.com/blog/index.php?p=15