On 11/7/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
What exactly is the "goal"? If it's just
"formally
defining whatever it is that the code currently does", is that a worthy
goal?
Probably not. The best we can hope for is likely something like:
1) A BNF grammar is developed that fits almost all the commonly-used
features in. This will probably require unlimited lookahead, but I do
think (without, admittedly, much of any formal grounding in the theory
of all this) it's possible if that's allowed, keeping in mind the
"almost all" caveat.
2) Now that we have a grammar, a yacc parser is compiled, and
appropriate rendering bits are added to get it to render to HTML.
3) The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be
ideal: one recursive pass to deal with templates and other
substitution-type things, then a second pass with the actual grammar
of most of the language. The first pass is of necessity recursive, so
there's probably no point in having it spend the time to repeatedly
parse italics or whatever, when it's just going to have to do it again
when it substitutes stuff in. Further rendering passes are going to
be needed, e.g., to insert the table of contents. Further parsing
passes may or may not be needed.
4) All of this breaks a thousand different corner cases and half the
parser tests. The implementers carefully go through every failed
parser test, rewrite it to the actual output, and carefully justify
why this is the correct course of action. Or just assume it is,
depending on the level of care.
5) A PHP implementation of the exact same grammar is implemented. How
practical this is, I don't know, but it's critical unless we want
pretty substantially different behavior for people using the PHP
module versus not. It is not acceptable to force third parties to use
a PHP module, nor to grind their parser to a halt (which a naive
compilation of the grammar into PHP would probably do).
6) Everything is rolled out live. Pages break left and right. Large
complaint threads are started on the Village Pump, people fix it, and
everyone forgets about it. Developers get a warm fuzzy feeling for
having finally succeeded at destroying Parser.php.
This is if it's to be done properly. A semi-formal specification
that's not directly useful for parsing pages would involve a lot less
work and perhaps correspondingly less benefit. It could still improve
operability with third parties dramatically; perhaps that's the only
goal other people have in mind, not the ability to compile a parser
with some yacc equivalent. I don't know.
On 11/7/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
Not to mention that BNF is not really suited to the
task. BNF is supposed
to answer the question "does text A match grammar B?" However, essentially
all wikitext is "valid" - so we're really looking for something that
answers
the question "how should text A be rendered" or "what is the meaning of
text
A" or even "how should text A be converted into a decorated* syntax tree".
BNF does that. The *language* generated by a grammar is distinct from
the grammar itself: two grammars can be different but generate the
same language. In this case, the language might be the set of all
strings, but applying the grammar to a string gets us a parse tree,
which is what we want. Specifically, yacc and similar programs (e.g.,
bison) will execute provided code snippets every time they encounter a
particular terminal symbol from the grammar, or something like that, I
gather. This should be able to include appending to an HTML output
string.