HtmlCleaner 2.0 release

Discussions

News: HtmlCleaner 2.0 release

  1. HtmlCleaner 2.0 release (8 messages)

    HtmlCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on). New version brings some importants improvements:
    • HtmlCleaner's document object model now has number of methods for node and atribute manipulation, so it is easy to search or modify it before serialization.
    • Basic XPath is supported on HtmlCleaner DOM
    • Creating custom tag set and rules for tag balancing is now much easier with XML configuration file.
    • Number of fixes and API improvements.
    From its first release, HtmlCleaner was tested and used by lot of developers who were giving valuable suggestions, bug reports and code patches. New version is trying to satisfy most of the requests and hopefully be useful for anyone who needs structured HTML.

    Threaded Messages (8)

  2. what about NekoHTML?[ Go to top ]

    How does HtmlCleaner compare to NekoHTML?
  3. Re: what about NekoHTML?[ Go to top ]

    How does HtmlCleaner compare to NekoHTML?
    HtmlCleaner was initially developed as helper project for other open-source project called Web-Harvest. At the time I was evaluating few best known Java-based HTML parsers: JTidy, TagSoup, NekoHTML - and wasn't happy at all. They simply didn't do the job. I needed reliable tool that will always produce well-formed XML. And here is the state from NekoHTML's home page: There are HTML documents for which NekoHTML cannot properly generate a well-formed XML document event stream...
  4. I *love* XPath. All this get child of child of child method stuff...bleah. XPath!
  5. Version 2.0 has very poor xpath support.

  6. Re: HtmlCleaner 2.0 release[ Go to top ]

    Great news, thanks! Always liked to use it at the times I've required such tool!
  7. What DOM implementation is used? It would be nice if Xerces was used as NekoHTML does.
  8. What DOM implementation is used?
    It would be nice if Xerces was used as NekoHTML does.
    It has internal DOM which is as light as possible, since performance was very important issue. However, if it doesn't satisfy the needs, it can be easily transformed to standard DOM or JDom.
  9. jericho[ Go to top ]

    Have you checked out Jericho? http://jerichohtml.sourceforge.net/doc/index.html Its pretty awesome, dealing with all sorts of HTML nonsense, and allows update/replace of elements, attributes, etc, while retaining non-well formedness.