-
HtmlCleaner, open-source HTML parser released (5 messages)
- Posted by: Vladimir Nikic
- Posted on: November 27 2006 05:49 EST
HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. HtmlCleaner is open-source HTML parser written in Java. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing. At the time of developing this tool, some open source Java solutions have existed for a long time. However, by the author's experience, they are either not maintained any more or fail to produce well-formed XML in all cases. A few of them make sometimes XML results with unexpected or unstable structure. This was the main motive for starting this project - to create small (JAR file bellow 30K), fast and reliable tool that will always produce well-formed XML.Threaded Messages (5)
- Re: HtmlCleaner, open-source HTML parser released by Kit Davies on November 27 2006 06:28 EST
- Re: HtmlCleaner, open-source HTML parser released by Sergey Smirnov on November 27 2006 13:49 EST
- Re: Re: HtmlCleaner, open-source HTML parser released by Robert Schmelzer on November 27 2006 16:26 EST
- Re: Re: HtmlCleaner, open-source HTML parser released by Rickard Oberg on November 28 2006 02:27 EST
- Re: Re: HtmlCleaner, open-source HTML parser released by Robert Schmelzer on November 27 2006 16:26 EST
- HTMLCleaner Question by Natalie Hill on December 04 2006 15:44 EST
-
Re: HtmlCleaner, open-source HTML parser released[ Go to top ]
- Posted by: Kit Davies
- Posted on: November 27 2006 06:28 EST
- in response to Vladimir Nikic
Looks useful. Seems to have a few more features than TagSoup, which has served me well. Any comparisons on produced output? One feature that is missing though is an org.xml.sax.XMLReader adapter. This would make it much easier to use in XSLT operations. Other than that, congratulations! Kit -
Re: HtmlCleaner, open-source HTML parser released[ Go to top ]
- Posted by: Sergey Smirnov
- Posted on: November 27 2006 13:49 EST
- in response to Vladimir Nikic
Ajax4jsf uses the filter that allows to clean up the HTML on the fly. The Tidy code-base is used there. Probably, we can switch to this HtmlCleaner is the future. You are right. The developing pages are usually dirty. It is inappropriate for Ajax behavior because browser does not correct the code when it is inserted to the DOM directly. What is interesting: People say: "your filter has broken our HTML layout!". The browser interprets the ill-formatted code other way that the same code, but cleaned up. So, you have a dilemma here: 1. to have the result code compliance with the W3C standards. 2. to immulite the browser's corrections. Be ready that people will blame the result in case of the first way. (Just a few guys can accept they might be wrong :-) Unfortunately, the second way is unpredictable. The result of interpretation is variable for each browser. -
Re: Re: HtmlCleaner, open-source HTML parser released[ Go to top ]
- Posted by: Robert Schmelzer
- Posted on: November 27 2006 16:26 EST
- in response to Sergey Smirnov
I think your filter should at least warn the user or even better - throw an exception - in the case it gets not well formed output. Because basically this is a major development error and should be found during system testing. Therefore an exception would help much here. I also have to take a look at HtmlCleaner. I really share the same troubles quite often. Does anybody know if something similar exists for PHP? -
Re: Re: HtmlCleaner, open-source HTML parser released[ Go to top ]
- Posted by: Rickard Oberg
- Posted on: November 28 2006 02:27 EST
- in response to Robert Schmelzer
I think your filter should at least warn the user or even better - throw an exception - in the case it gets not well formed output. Because basically this is a major development error and should be found during system testing.
The fact that (for example) TagSoup does *not* throw an exception on HTML errors is what makes it usable in the first place. We use it to proxy any HTML web that is out there (similar to what portletbridge.org does), and if the HTML had to be well-formed it would be useless. Just the way it is... -
HTMLCleaner Question[ Go to top ]
- Posted by: Natalie Hill
- Posted on: December 04 2006 15:44 EST
- in response to Vladimir Nikic
What is wrong here? Why do I get an exception on this line: HtmlCleaner cleaner = new HtmlCleaner("Some String"); ? public String getHtmlCleanedElementValue(){ String returnVal=""; HtmlCleaner cleaner = new HtmlCleaner("Some String"); try{ cleaner = new HtmlCleaner(elementValue); cleaner.clean(); returnVal = cleaner.getPrettyXmlAsString(); System.out.println("********* getHtmlCleanedElementValue: " + returnVal); }catch(java.io.IOException e){ System.out.println("********* getHtmlCleanedElementValue: Unable to save clean HTML"); }finally{ return returnVal; } }