On 5/2/11 5:28 PM, Tim Starling wrote:
How many wikitext parsers does the world really need?
That's a tricky question. What MediaWiki calls parsing, the rest of the
world calls
1. Parsing
2. Expansion (i.e. templates, magic)
3. Applying local state, preferences, context (i.e. $n, prefs)
4. Emitting
And phases 2 and 3 depend heavily upon the state of the local wiki at
the time the parse is requested. If you've ever tried to set up a test
wiki that works like Wikipedia or Wikimedia Commons you'll know what I'm
talking about.
As for whether the rest of the world needs another wikitext parser:
well, they keep writing them, so there must be some reason why this
keeps happening. It's true that language chauvinism plays a part, but
the inflexibility of the current approach is probably a big factor as
well. The current system mashes parsing and emitting to HTML together,
very intimately, and a lot of people would like those to be separate.
- if they're doing research or stats, and want a more "pure", more
normalized form than HTML or Wikitext.
- if they're Google, and they want to get all the city infobox data
and reuse it (this is a real request we've gotten)
- if they're OpenStreetMaps, and the same thing;
- if they're emitting to a different format (PDF, LaTeX, books);
- if they're emitting to HTML but with different needs (like mobile);
And then there's the stuff which you didn't know you wanted, but which
becomes easy once you have a more flexible parser.
A couple of months ago I wrote a mini PEG-based wikitext parser in
JavaScript, that Special:UploadWizard is using, today, live on Commons.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/res…
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/res…
While it was a bit of a heavy download (7K compressed) this gave me the
ability to do pluralizations in the frontend (e.g. "3 out of 5 uploads
complete") even for difficult languages like Arabic. Great!
But the unexpected benefit was that it also made it a snap to add very
complicated interface behaviour to our message strings. Actually, right
now, with this library + the ingenious way that wikitext does i18n, we
may have one of the best libraries out there for internationalized user
interfaces. I'm considering splitting it off; it could be useful for any
project that used translatewiki.
But I don't actually want to use JavaScript for anything but the final
rendering stages (I'd rather move most of this parser to PHP) so stay tuned.
Anyway, I think it's obviously possible for us to do some RTE, and some
of this stuff, with the current parser. But I'm optimistic that a new
parsing strategy will be a huge benefit to our community, and our
partners, and partners we didn't even know we could have. Imagine doing
RTE with an implementation in a JS frontend, that is generated from some
of the same sources that the PHP backend uses.
For what it's worth: whenever I meet with Wikia employees the topic is
always about what MediaWiki and the WMF can do to make their RTE hacks
obsolete. That doesn't mean that their RTE isn't the right way forward,
but the people who wrote it don't seem to be very strong advocates for
it. But I don't want to put words in their mouth; maybe one of them can
add more to this thread?
--
Neil Kandalgaonkar <neilk(a)wikimedia.org>