As we were (OK: I am;-) running into trouble integrating HTML-to-XML
parsing into the Bison-based parser, I have written a specialized C++
class that can do this prior to the actual parsing. It will output only
correct XML *structure*, and (as far as I can tell) correct XHTML rules
(<tr> in <table> etc.) as well.
"Broken" HTML will be changed into < / > entities, so only valid
XML will reach the output. However, I took some care to automagically
fix the "usual suspects" (obligatory 21C3 reference) of HTML ugliness,
like not-closed <li> and various table chaos. Even a lonely <caption>
(not closed) somewhere in the text will generate a full table. It might
not be pretty, but it will be vaild XML.
While this is primarily intended for the wiki-to-XML parser, it might
work for enforcing XML output for the current parser as well. We'd only
have to run the wiki source through it before actually parsing.
Source: CVS HEAD, Module "flexbisonparse", file "html2xml.cpp". (GPL,
of
course)
Magnus