Eric Astor wrote:
I've actually been working on a Python-based wikitext parser, using some
techniques that should make the system a bit faster and cleaner... With a
lot of luck, I should start making progress on that again in the next month
or so.
For anyone who cares, I'll probably be trying to implement a PEG-based
parser using mxTextTools, since I think that should be able to parse all of
MediaWiki's wikitext, and should be about twice as fast as the current
Parser.php (which is about as fast as wiki2xml)... Or I might just end up
using ANTLR, if I can bully my current semi-grammar into working in that
framework... If anyone knows of a decent PEG parser with a Python API (a
packrat parser might be ideal), that'd be great too. *shrugs*
- Eric Astor
Yes! I also believe that PEGs and [[packrat parser]]s are the way to go
with parsing wikitext, because of the very ad-hoc definition of wikitext.
A basic packrat parser is pretty easy to implement; it's simply a
brute-force recursive-descent parser with memoization of (offset, term)
-> production mappings. Scheme is a pretty good language to write a
packrat parser in, since the grammar itself can be written as an
S-expression, and is easy to use for program transformation (see below).
A simple implementation just interprets the grammar tree, matching as it
goes.
You can achieve considerable speedups by:
1 using the grammar to generate code, and compiling and executing that
instead of interpreting the grammar by hand
2 allowing the grammar to contain both PEG expressions and regexps for
low-level lexical matching: regexps will be at least an order of
magnitude faster than even compiled PEGs for matching low-level lexical
tokens like numbers and names, without removing the ability of PEGs to
blur the distinction between lexical and syntactic analysis, which is
important for parsing strange things like wikitext.
I've implemented packrat parsing in both Python and Scheme: Scheme was
faster, and ultimately more natural.
The one awkward bit is left-recursion removal, which breaks packrat
parsers unless you alter the grammar to an equivalent form without left
recursion. I did it by hand on my input grammars, but it could easily be
done programatically at grammar-generation time.
I'm not sure about the best way to implement an API: have you considered
just using the parser to convert from wikitext to somthing like PYX,
which is a very simple-to-parse and Python-friendly representation of an
XML data structure, and can then be used either to build an in-core
parse tree, or drive something like a SAX API, or whatever other form of
post-processing you like (for example, direct procedural text-to-text
generation, which could be very simple indeed, since the output of a
successful parse is guaranteed by definition to _exactly_ conform to the
grammar specification).
-- Neil
[PYX:
http://www.xml.com/pub/a/2000/03/15/feature/index.html]