Discussions

News: HtmlCleaner, open-source HTML parser released

  1. HtmlCleaner, open-source HTML parser released (5 messages)

    HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. HtmlCleaner is open-source HTML parser written in Java. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing. At the time of developing this tool, some open source Java solutions have existed for a long time. However, by the author's experience, they are either not maintained any more or fail to produce well-formed XML in all cases. A few of them make sometimes XML results with unexpected or unstable structure. This was the main motive for starting this project - to create small (JAR file bellow 30K), fast and reliable tool that will always produce well-formed XML.
  2. Looks useful. Seems to have a few more features than TagSoup, which has served me well. Any comparisons on produced output? One feature that is missing though is an org.xml.sax.XMLReader adapter. This would make it much easier to use in XSLT operations. Other than that, congratulations! Kit
  3. Ajax4jsf uses the filter that allows to clean up the HTML on the fly. The Tidy code-base is used there. Probably, we can switch to this HtmlCleaner is the future. You are right. The developing pages are usually dirty. It is inappropriate for Ajax behavior because browser does not correct the code when it is inserted to the DOM directly. What is interesting: People say: "your filter has broken our HTML layout!". The browser interprets the ill-formatted code other way that the same code, but cleaned up. So, you have a dilemma here: 1. to have the result code compliance with the W3C standards. 2. to immulite the browser's corrections. Be ready that people will blame the result in case of the first way. (Just a few guys can accept they might be wrong :-) Unfortunately, the second way is unpredictable. The result of interpretation is variable for each browser.
  4. I think your filter should at least warn the user or even better - throw an exception - in the case it gets not well formed output. Because basically this is a major development error and should be found during system testing. Therefore an exception would help much here. I also have to take a look at HtmlCleaner. I really share the same troubles quite often. Does anybody know if something similar exists for PHP?
  5. I think your filter should at least warn the user or even better - throw an exception - in the case it gets not well formed output. Because basically this is a major development error and should be found during system testing.
    The fact that (for example) TagSoup does *not* throw an exception on HTML errors is what makes it usable in the first place. We use it to proxy any HTML web that is out there (similar to what portletbridge.org does), and if the HTML had to be well-formed it would be useless. Just the way it is...
  6. HTMLCleaner Question[ Go to top ]

    What is wrong here? Why do I get an exception on this line: HtmlCleaner cleaner = new HtmlCleaner("Some String"); ? public String getHtmlCleanedElementValue(){ String returnVal=""; HtmlCleaner cleaner = new HtmlCleaner("Some String"); try{ cleaner = new HtmlCleaner(elementValue); cleaner.clean(); returnVal = cleaner.getPrettyXmlAsString(); System.out.println("********* getHtmlCleanedElementValue: " + returnVal); }catch(java.io.IOException e){ System.out.println("********* getHtmlCleanedElementValue: Unable to save clean HTML"); }finally{ return returnVal; } }