Krzysztof Kowalczyk wrote:
>>I
therefore suggest a new structure:
>>1. Preprocessor
>>2. Wiki markup to XML
>>3. XML to (X)HTML
>>
>>
Why XML for intermediate part? As I understand it, speed is important
for WikiPedia code given it needs to scale to a very high usage.
Servers are always overloaded.
My reasons for XML:
* Standardized (for other applications, proven parsers available, easier
for new developers to deal with)
* IMHO best way to pass data between a #2 external parser and MediaWiki
* Basing #3 on guaranteed valid XML will ease generation of valid XHTML
a lot
For example:
<wikiPage>
<title>Title of the page</title>
<body>This is body</body>
</wikiPage>
can be represented with object wikiPage() that has members title and
body. You can extrapolate this example to any XML with schema known
up-front. Then you can ship those objects around. It's as clean
approach (the abstraction provided is the same, it's just a different,
more efficient way of providing this abstraction to clients).
Yes, that could be done, but it would require to stay in PHP. Also, I
(personally) like to stay in text-to-text conversion for #2, but that's
just my taste.
But as I write it, I see even less reason to use XML -
it's just not a
good format to represent wiki markup structure. Can you give an
example of how this XML would look like for some simple wiki markup?
Further discussion about speed/space/code cleanness trade-offs is a
bit hard without knowing more details about proposed approach - it's
vague enough to have more than one interpretation.
Well, something like
[[image:bla.jpg|thumb=bla_small.jpg|150px|Text and [[a link]]]]trail
would become
<wikilink page='image:bla.jpg' thumb='bla_small.jpg'
width='150px'>Text
and <wikilink page='a link'>a link</wikilink>trail</wikilink>
That's actual output of my parser. I will put the source on the net
somewhere once it has all major functions. Missing at this point are:
* Handling of <nowiki> (and <pre>, respectively)
* XML validating (that's going to be the hardest part;-)
* table markup
* external links and "brute force" markup like ISBNs
Otherwise, it is functional already. It even does the ''italics
'''italics bold''''' thingy right ;-)
Haven't tested it on "real" wikipedia pages yet, though.
Magnus