On 2/11/06, Brion Vibber <brion(a)pobox.com> wrote:
If you'd like try rewriting [Sanitizer::removeHTMLtags()] so it balances end tags
properly, detects illegal nesting cases, and understands MathML, that would be
super awesome.
I thought a bit about it and I came to the conclusion that I don't
quite understand what Sanitizer::removeHTMLtags() is supposed to do.
Firstly, I was wrong about the MathML. Sanitizer::removeHTMLtags()
does not need to understand MathML, because at the point where it is
called, the Parser::strip() has replaced the <math> tags by
placeholder strings.
But the important point is: How can Sanitizer::removeHTMLtags()
balance tags? Consider the input
----
First line. <s> Struck through
More text.
Another paragraph.
----
There is an unclosed <s> tag here, so removeHTMLtags() should close
it. If it does the same as HTML Tidy, it adds </s> after "More text.",
before the </p> implied by the empty line.
Okay, that's fine. But now consider
----
* First line. <s> Struck through
* More text.
Another paragraph.
----
In this case, the </s> has to be put after "Struck through".
I think this means that removeHTMLtags() can only work if it parses
the text according to (a subset of) the wiki-grammar. But that seems a
bit messy from the design point of view. The alternative is to call
removeHTMLtags() later, at the same time when HTML Tidy is called
(this is what I wrongly thought to happen).
By the way, I'm still interested to hear why you want to get rid of
HTML Tidy. Is it performance?
Cheers,
Jitse