Re: [Wikitech-l] flexbisonparse

28 Aug 2006

Eric Astor wrote:
...

 I've actually been working on a Python-based wikitext parser, using some
 techniques that should make the system a bit faster and cleaner... With a
 lot of luck, I should start making progress on that again in the next month
 or so.

 For anyone who cares, I'll probably be trying to implement a PEG-based
 parser using mxTextTools, since I think that should be able to parse all of
 MediaWiki's wikitext, and should be about twice as fast as the current
 Parser.php (which is about as fast as wiki2xml)... Or I might just end up
 using ANTLR, if I can bully my current semi-grammar into working in that
 framework... If anyone knows of a decent PEG parser with a Python API (a
 packrat parser might be ideal), that'd be great too. *shrugs*

 - Eric Astor

    Yes! I also believe that PEGs and [[packrat parser]]s are the way to go 
with parsing wikitext, because of the very ad-hoc definition of wikitext.

A basic packrat parser is pretty easy to implement; it's simply a 
brute-force recursive-descent parser with memoization of (offset, term) 
-> production mappings.  Scheme is a pretty good language to write a 
packrat parser in, since the grammar itself can be written as an 
S-expression, and is easy to use for program transformation (see below).

A simple implementation just interprets the grammar tree, matching as it 
goes.

You can achieve considerable speedups by:

1 using the grammar to generate code, and compiling and executing that 
instead of interpreting the grammar by hand

2 allowing the grammar to contain both PEG expressions and regexps for 
low-level lexical matching: regexps will be at least an order of 
magnitude faster than even compiled PEGs for matching low-level lexical 
tokens like numbers and names, without removing the ability of PEGs to 
blur the distinction between lexical and syntactic analysis, which is 
important for parsing strange things like wikitext.

I've implemented packrat parsing in both Python and Scheme: Scheme was 
faster, and ultimately more natural.

The one awkward bit is left-recursion removal, which breaks packrat 
parsers unless you alter the grammar to an equivalent form without left 
recursion. I did it by hand on my input grammars, but it could easily be 
done programatically at grammar-generation time.

I'm not sure about the best way to implement an API: have you considered 
just using the parser to convert from wikitext to somthing like PYX, 
which is a very simple-to-parse and Python-friendly representation of an 
XML data structure, and can then be used either to build an in-core 
parse tree, or drive something like a SAX API, or whatever other form of 
post-processing you like (for example, direct procedural text-to-text 
generation, which could be very simple indeed, since the output of a 
successful parse is guaranteed by definition to _exactly_ conform to the 
grammar specification).

-- Neil

[PYX: http://www.xml.com/pub/a/2000/03/15/feature/index.html]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] flexbisonparse