On Fri, 09 Nov 2007 15:24:10 +1100, Steve Bennett wrote:
On 11/9/07, Simetrical
<Simetrical+wikilist(a)gmail.com> wrote:
According to flex documentation, it's perfectly happy to accept any
regex for tokens, and will use unlimited lookahead and backtracking if
necessary. It provides debug info allowing you to check for and
eliminate backtracking, if you want to speed it up, but that's optional.
Clearly it's not possible to tokenize MW markup with one-character
lookahead: you sure can't tell the difference between a second- and
sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be.
The fact that == looks like === looks like ==== is neither here nor there
to the grammar - it's a handy mnemonic for humans, that's all.
Well, that's exactly the point. At first glance, === is obviously a token,
which will perfectly handle 99% of the headings out there. But if we want
a complete grammar, we really need sane handling for the last 1%.
To get those into one token would require the tokenizer to do a bit of
parsing to match things up; however, if the tokenizer just determines that
it is a token, and passes a value to the parser, so the parser can deal
with the values, that would probably be a cleaner implementation.
I'm not sure if there's a notation for values in EBNF, so to invent one
for this example, treating
===head==
as "==" TEXT("=head") "==" would be nice, but tricky.
as "="(3) TEXT("head") "="(2) would make for a cleaner
lexer, and the
parser should be able to handle that without too much trouble.