- HtmlCleaner's document object model now has number of methods for node and atribute manipulation, so it is easy to search or modify it before serialization.
- Basic XPath is supported on HtmlCleaner DOM
- Creating custom tag set and rules for tag balancing is now much easier with XML configuration file.
- Number of fixes and API improvements.
HtmlCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on). New version brings some importants improvements:
- what about NekoHTML? by Jaap Beetstra on July 15 2008 15:04 EDT
- Re: HtmlCleaner 2.0 release by Istvan Soos on July 16 2008 03:10 EDT
- Re: HtmlCleaner 2.0 release by Jose Maria Arranz on July 16 2008 03:43 EDT
- jericho by greg matthews on July 17 2008 18:42 EDT
How does HtmlCleaner compare to NekoHTML?
How does HtmlCleaner compare to NekoHTML?HtmlCleaner was initially developed as helper project for other open-source project called Web-Harvest. At the time I was evaluating few best known Java-based HTML parsers: JTidy, TagSoup, NekoHTML - and wasn't happy at all. They simply didn't do the job. I needed reliable tool that will always produce well-formed XML. And here is the state from NekoHTML's home page: There are HTML documents for which NekoHTML cannot properly generate a well-formed XML document event stream...
I *love* XPath. All this get child of child of child method stuff...bleah. XPath!
Version 2.0 has very poor xpath support.
Great news, thanks! Always liked to use it at the times I've required such tool!
What DOM implementation is used? It would be nice if Xerces was used as NekoHTML does.
What DOM implementation is used?It has internal DOM which is as light as possible, since performance was very important issue. However, if it doesn't satisfy the needs, it can be easily transformed to standard DOM or JDom.
It would be nice if Xerces was used as NekoHTML does.
Have you checked out Jericho? http://jerichohtml.sourceforge.net/doc/index.html Its pretty awesome, dealing with all sorts of HTML nonsense, and allows update/replace of elements, attributes, etc, while retaining non-well formedness.