Neil Harris wrote:
Yes! I also believe that PEGs and [[packrat parser]]s
are the
way to go with parsing wikitext, because of the very ad-hoc
definition of wikitext.
Absolutely agreed. I only wish PEGs could support backreference matches, as
it would clean up list, allowed HTML, and extension handling. In fact, I'm
not quite sure how to handle lists without backreferences.
You can achieve considerable speedups by:
1 using the grammar to generate code, and compiling and
executing that instead of interpreting the grammar by hand
Definitely - come to think of it, I bet this could be done VERY nicely with
Python. Or most other sufficiently self-exposed languages... Hm.
2 allowing the grammar to contain both PEG expressions
and
regexps for low-level lexical matching: regexps will be at
least an order of magnitude faster than even compiled PEGs
for matching low-level lexical tokens like numbers and names,
without removing the ability of PEGs to blur the distinction
between lexical and syntactic analysis, which is important
for parsing strange things like wikitext.
This sounds like a great idea for extended PEGs anyway... I'll remember that
if I end up building an mxTextTools frontend for PEGs, since mxTextTools can
easily hook into arbitrary matching functions (including regex).
I've implemented packrat parsing in both Python
and Scheme:
Scheme was faster, and ultimately more natural.
That's quite possible - the problem would be that I don't know Scheme, and I
am going to be extremely busy for the foreseeable future at school. I'd
rather not have to write a packrat parser myself, anyway... However simple
they may be, they improve drastically with optimizations, and I don't
anticipate having the time to implement a proper system.
Unless a good Python-accessible packrat parser already exists, I'm most
likely to just build a solid PEG frontend for mxTextTools. It's a very
powerful text parser, and tends to be fast (the module's mostly written in
C). I think it could easily support all PEG features. Actually, I think
SimpleParse (another mxTextTools frontend) already supports at least 90% of
PEG features, so maybe the best idea is simply to rework SimpleParse to use
standard PEG syntax instead of its extremely extended BNF variant.
I'm not sure about the best way to implement an
API: have you
considered just using the parser to convert from wikitext to
somthing like PYX, which is a very simple-to-parse and
Python-friendly representation of an XML data structure...
Something like that would probably be ideal, although I'd tend to prefer a
more abstract data structure that's programmatically accessible - maybe an
mxTextTools tag list (its normal output format) is closer to what I mean.
- Eric Astor
mxTextTools:
http://www.egenix.com/files/python/mxTextTools.html
SimpleParse:
http://simpleparse.sourceforge.net/simpleparse_grammars.html
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.11.6/428 - Release Date: 8/25/2006