Magnus Manske wrote:
Or we could use one of these
weird compiler-generating languages, or parser-generating ones, if there
is such a thing. The point is, it will be spearate (read: independent)
from the rest of the software, which will simplify things enormously, IMHO.
Personally, I am very much in favour of using such a parser generator.
Some time ago, someone here on this mailing list already proposed this,
but I can't find it now. One major advantage of this would be that we
can extend the grammar and have the parser be generated from it. We
would no longer have to actually tweak the parser. Other advantages
included efficiency, and simply the assurance that this is the "correct
way", because all other professional applications do it this way.
The process is usually broken down into four phases:
(1) lexing -- turn raw text into series of tokens
(2) parsing -- turn series of tokens into parse tree
(3) processing
(4) compiling -- turn processed parse tree into requested output format
This is extremely general; this whole procedure can apply to programming
language compilers (e.g. gcc), markup processors (e.g. browsers, LaTeX)
and pretty much anything else that turns a text file in one format into
some other format (not necessarily text: in the case of a compiler, it
would be executable code). Because of this generality, many tools to
perform these tasks already exist. In the case of step 1, this is what
"lex" does. Step 2 is the field of expertise of parser generators such
as "yacc" or its free-software equivalent "bison". These are
C-centric
in the sense that they output C code; I'm sure PHP ones exist, but maybe
we want to use C for efficiency anyway. Steps 3 and 4 are
application-dependent, so they are programmed manually, but given a
parse tree, they are easy.
The "process" step is particularly application-dependent; in the case of
a programming language compiler, for example, it might perform
optimisations. In our case, it means:
(a) find template inclusions, recursively call this entire process with
the template's wiki text and replace the template inclusion with the
resulting parse tree;
(b) find links and determine if the page they point to is non-existent,
a stub, etc., and "annotate" the parse tree accordingly;
(c) probably other little things I haven't thought of.
I would be more than willing to help with this, especially steps 3 and 4
:-), but since I have absolutely no experience with lex or bison, I
would need some help with those.
Have I mentioned yet that this is the only correct way to do this? :-)
Timwi