[Wikitech-l] Sanitizer::removeHTMLtags()

16 Feb 2006

On 2/11/06, Brion Vibber &lt;brion(a)pobox.com&gt; wrote:
...

 If you'd like try rewriting [Sanitizer::removeHTMLtags()] so it balances end tags
 properly, detects illegal nesting cases, and understands MathML, that would be
 super awesome. 
I thought a bit about it and I came to the conclusion that I don't
quite understand what Sanitizer::removeHTMLtags() is supposed to do.

Firstly, I was wrong about the MathML. Sanitizer::removeHTMLtags()
does not need to understand MathML, because at the point where it is
called, the Parser::strip() has replaced the <math> tags by
placeholder strings.

But the important point is: How can Sanitizer::removeHTMLtags()
balance tags? Consider the input
----
First line. <s> Struck through
More text.

Another paragraph.
----
There is an unclosed <s> tag here, so removeHTMLtags() should close
it. If it does the same as HTML Tidy, it adds </s> after "More text.",
before the </p> implied by the empty line.

Okay, that's fine. But now consider
----
* First line. <s> Struck through
* More text.

Another paragraph.
----
In this case, the </s> has to be put after "Struck through".

I think this means that removeHTMLtags() can only work if it parses
the text according to (a subset of) the wiki-grammar. But that seems a
bit messy from the design point of view. The alternative is to call
removeHTMLtags() later, at the same time when HTML Tidy is called
(this is what I wrongly thought to happen).

By the way, I'm still interested to hear why you want to get rid of
HTML Tidy. Is it performance?

Cheers,
Jitse

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Sanitizer::removeHTMLtags()