On Tue, Oct 14, 2003 at 11:16:19AM +1300, Richard Grevers wrote:
On Mon, 13 Oct 2003 17:21:15 -0400, David Friedland
<david(a)nohat.net> gave
utterance to the following:
There seems to be a lot of disjoint discussion on
Meta about this. Viz:
* There is work that has been done by Taw on an OCAML lexer at
<http://meta.wikipedia.org/wiki/Wikipedia_lexer>
My suggestions would be "the broken wikitext language", or the "invalid
wikitext language".
Because of its UseMod ancestry, the current parser produces some very bad
HTML code*, and in particular handles lists and nesting of blocks really
badly.
* not so bad if HTML 3.2 or 4 is our target, but it would be nice to be
able to produce clean XHTML.
A few months back I started work on a ValidWiki parser, which has a much
stronger concept of block and line elements, and uses both block and line
stacks to open and close all elements correctly.
I think I'm about 2/3 of the way through the block parser, and hadn't yet
written the line parser. I have no idea how the code would comapre for
efficiency.
Unfortunately the only language I know how to code in is MivaScript, so it
would need porting. (Miva performs okay for your mid-level merchant
application, but doesn't have the efficiency for something with the
workload of Wikipedia.
Uhm, my parser has block stack + line stack architecture too.
But the sources at
http://meta.wikipedia.org/wiki/Wikipedia_lexer aren't
the most recent.
Newer sources attached.
It's not complete but it wasn't really meant to be.
It was meant to be a proof of concept that a mix of wiki markup and HTML can
be parsed in a XHTML-correct and DWIM way extremely efficiently.
Concept proven, but integrating the parser with the rest of Wikipedia would
take much more time than I'm willing to spend right now.