On Feb 8, 2008 3:31 PM, Steve Bennett <stevagewp(a)gmail.com> wrote:
On 2/9/08, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
My
http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php
parses it correctly as well, but it's still manual PHP hacks, while
your is a real parser - respect!
Not too much respect. I think I have only *just* worked out why I need
all these syntactic predicates and what backtracking is used for.
Throughout my grammar everywhere I have had to place these predicates like this:
((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket
// try and save it some time on [[[foo]]]?
|(literal_left_bracket bracketed_url) => literal_left_bracket
|(image) => image
|(category) => category
|(external_link) => external_link
|(internal_link) => internal_link
|(magic_link) => magic_link
|pre_block
|(formatted_text_elem) =>formatted_text_elem
)
The bit before the => on each line basically says "look ahead, and if
the syntax matches the bit in brackets, then go ahead and parse it as
the bit after the =>.
I never knew why I needed them to make it work, but now I see: in the
case of an image, if it just dove straight into trying to parse a
string like [[image:foo]] (not a valid image), it would hit the first
[[, think the image rule matched, and keep going. Eventually it would
realise the rule didn't match but it would be too late: because the
grammar is blatantly not LALR (I think?), it would just fail (unless
it could backtrack, which I'm not using). By using the syntactic
predicate, it's able to prevent itself from falling in a hole - it
looks ahead, sees "that looks like an image...oh wait, no it's not!",
and tries the next rule instead.
There's a huge amount of messiness in the grammar so far caused by me
not really understanding this stuff. I also haven't been very clean
about where newlines and whitespace are handled exactly.
Anyway, my latest rant about tables (sorry Magnus :)) In the following
table, which part is the style attribute for a table cell, and which
part is the cell contents:
{|
|an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop
|}
(reminder: cell definitions with style attributes look like this: |
style | contents ||...
Buggered if I know. I might have to impose a rule involving the range
of possible characters that could appear in the style attribute. I
didn't really want to have actually parse that bit properly...
That's exactly what I did in wiki2xml, and it works (yesss, still ahead;-)
Of course, I cheap out in another regard there: wiki2xml parses images
and links alike, and parses even links with "too many" parameters. My
reasons for that:
* Lazyness
* No need to know the language/wiki settings (which make "Image:"
special for en)
* Flexible for "add-ons" (who knows, we might use three-part links someday...)
* Not much additional burden for the next level (XML-to-something)
Cheers,
Magnus