I'm in the process of rewriting the old Sanitizer::removeHTMLTags to work much
better. The new code properly closes implied end-tags and obeys some additional
HTML rules about what can go where.
In-progress patches posted at:
http://bugzilla.wikimedia.org/show_bug.cgi?id=5497
Before I finish this up, though, it would be good if we can agree on how to
handle a few things.
** HTML across template boundaries
Right now there's a big behavior difference between the regular mode and the
behavior with Tidy enabled. In regular mode, the HTML nesting and closing rules
are separately applied to every transcluded text chunk. In Tidy mode, only the
allowed-HTML check is applied at that stage, and nesting and closing is left for
Tidy to fix things up at the very end.
An example of a construct that breaks is a template that defines a table header
like:
<table class="fooba">
and is included like this:
{{cool-table-start}}
{{cool-row|blah}}
{{cool-table-end}}
In current non-Tidy mode, this breaks violently as the <table> gets closed in
the first template, and then all the following <tr>, <td> etc are rejected as
they're not allowed in body text.
In current Tidy mode this is allowed to pass on through just fine; the pieces
are assembled and then checked for nesting later.
I really don't like this kind of construct as it makes it harder to treat
transclusions at the abstract-parse-tree level in the future; in order to
understand the markup _following_ the transclusion you need to have already
expanded it. Yucky!
However the current system allows the same thing to work with wiki tables (eg
{|class="fooba") in either mode. I'm pretty sure at least the latter are in
fairly common use on Wikipedia.
So we either need to decide to Kill Them All, or accept the sacrifice for
compatibility.
** Inline HTML across wiki blocks
Currently, removeHTMLTags is applied before most other parsing steps, most
notably doBlockLevels which handles paragraph splitting, wiki lists, etc.
A consequence of this is that bad nesting / illegal overlapping can occur with a
construct like this:
<b>First paragraph
Second paragraph
The HTML normalizer adds the missing close tag:
<b>First paragraph
Second paragraph</b>
and later the wiki block levels adds <p> tags:
<p><b>First paragraph
</p><p>Second paragraph</b>
</p>
This is fairly obviously incorrect; it _probably_ would make a reasonable amount
of sense to rework how the block levels interact with stuff so it happens either
up before, or in concert with, the HTML normalization.
** Mixing of HTML and wiki tables
Running tests on pages from French Wikipedia, I found a cute bugger that does
something like this:
{|
<caption>A table caption</caption>
|-
|blah
|}
Since tables haven't been replaced in the output yet, this <caption> is in a
<body> context as far as the HTML normalizer sees and it fails. But the old code
let it through, in both tidy and non-tidy mode.
While this kind of admixture looks *supremely ugly* to me, do we have any reason
to disallow it?
Should we think of the wiki table syntax as just a shortcut/transformation to
HTML table tags, or should they be entirely separate entities?
-- brion vibber (brion @
pobox.com)