- HtmlCleaner's document object model now has number of methods for node and atribute manipulation, so it is easy to search or modify it before serialization.
- Basic XPath is supported on HtmlCleaner DOM
- Creating custom tag set and rules for tag balancing is now much easier with XML configuration file.
- Number of fixes and API improvements.
-
HtmlCleaner 2.0 release (8 messages)
- Posted by: Vladimir Nikic
- Posted on: July 15 2008 04:35 EDT
HtmlCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on). New version brings some importants improvements:Threaded Messages (8)
- what about NekoHTML? by Jaap Beetstra on July 15 2008 15:04 EDT
- Re: what about NekoHTML? by Vladimir Nikic on July 15 2008 17:34 EDT
- Good news about XPath, that's very handy by Cole Thompson on July 19 2008 19:41 EDT
- Good news about XPath, that's very handy by Angel Cervera Claudio on December 15 2010 07:45 EST
- Re: HtmlCleaner 2.0 release by Istvan Soos on July 16 2008 03:10 EDT
- Re: HtmlCleaner 2.0 release by Jose Maria Arranz on July 16 2008 03:43 EDT
- Re: HtmlCleaner 2.0 release by Vladimir Nikic on July 16 2008 04:28 EDT
- jericho by greg matthews on July 17 2008 18:42 EDT
-
what about NekoHTML?[ Go to top ]
- Posted by: Jaap Beetstra
- Posted on: July 15 2008 15:04 EDT
- in response to Vladimir Nikic
How does HtmlCleaner compare to NekoHTML? -
Re: what about NekoHTML?[ Go to top ]
- Posted by: Vladimir Nikic
- Posted on: July 15 2008 17:34 EDT
- in response to Jaap Beetstra
How does HtmlCleaner compare to NekoHTML?
HtmlCleaner was initially developed as helper project for other open-source project called Web-Harvest. At the time I was evaluating few best known Java-based HTML parsers: JTidy, TagSoup, NekoHTML - and wasn't happy at all. They simply didn't do the job. I needed reliable tool that will always produce well-formed XML. And here is the state from NekoHTML's home page: There are HTML documents for which NekoHTML cannot properly generate a well-formed XML document event stream... -
Good news about XPath, that's very handy[ Go to top ]
- Posted by: Cole Thompson
- Posted on: July 19 2008 19:41 EDT
- in response to Jaap Beetstra
I *love* XPath. All this get child of child of child method stuff...bleah. XPath! -
Good news about XPath, that's very handy[ Go to top ]
- Posted by: Angel Cervera Claudio
- Posted on: December 15 2010 07:45 EST
- in response to Cole Thompson
Version 2.0 has very poor xpath support.
-
Re: HtmlCleaner 2.0 release[ Go to top ]
- Posted by: Istvan Soos
- Posted on: July 16 2008 03:10 EDT
- in response to Vladimir Nikic
Great news, thanks! Always liked to use it at the times I've required such tool! -
Re: HtmlCleaner 2.0 release[ Go to top ]
- Posted by: Jose Maria Arranz
- Posted on: July 16 2008 03:43 EDT
- in response to Vladimir Nikic
What DOM implementation is used? It would be nice if Xerces was used as NekoHTML does. -
Re: HtmlCleaner 2.0 release[ Go to top ]
- Posted by: Vladimir Nikic
- Posted on: July 16 2008 04:28 EDT
- in response to Jose Maria Arranz
What DOM implementation is used?
It has internal DOM which is as light as possible, since performance was very important issue. However, if it doesn't satisfy the needs, it can be easily transformed to standard DOM or JDom.
It would be nice if Xerces was used as NekoHTML does. -
jericho[ Go to top ]
- Posted by: greg matthews
- Posted on: July 17 2008 18:42 EDT
- in response to Vladimir Nikic
Have you checked out Jericho? http://jerichohtml.sourceforge.net/doc/index.html Its pretty awesome, dealing with all sorts of HTML nonsense, and allows update/replace of elements, attributes, etc, while retaining non-well formedness.