2010-09-26 20:57, Aryeh Gregor skrev:
On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson
<andreas.jonsson(a)kreablo.se> wrote:
. . . You can can come up with
thousands of situations like this, and without a consistent plan on
how to handle them, you will need to add thousands of border cases to
the code to handle them all.
I have avoided this by simply disabling all html block tokens inside
wikitext list items. Of course, it may be that someone is actually
relying on being able to mix in this way, but it doesn't seem likely
as the result tends to be strange.
The way the parser is used in real life is that people just write
random stuff until it looks right. They wind up hitting all sorts of
bizarre edge cases, and these are propagated to thousands of pages by
templates.
Yes, this is a problem.
A pure-PHP parser is needed for end users who
can't
install binaries, and any replacement parser must be compatible with
it in practice, not just on the cases where the pure-PHP parser
behaves sanely. In principle, we might be able to change parser
behavior in lots of edge cases and let users fix the broken stuff, if
the benefit is large enough. But we'd have to have a pure-PHP parser
that implements the new behavior too.
Antlr is a multi language parser generator. Unfortunately
PHP is not currently on the list of target languages. Porting the
back end to PHP is a feasible task, however. Likewise, porting my
parser implementation to PHP is feasible. Then the later question is
if you want to maintain two language versions to also have the
performance advantage of the C parser.
The parts you considered to be the hard parts are not
that hard.
What support do you have for this claim? Parsing wikitext is
difficult, because of the any-input-is-valid-wikitext philosphy.
Parsing MediaWiki wikitext is very difficult, since it is not designed
to be parsable.
I consider the parts I pointed out to be hard because
they cannot be implemented with standard parser techniques. I've
developed a state machine for enabling/disabling individual token
productions depending on context; I've employed speculative execution
in the lexical analysator to support context sensitive lookahead. I
don't believe that you will find these techniques in any text book on
compiler design. So I consider these items hard in the sense that
before I started working on my implementation, it was not at all clear
that I would be able to find a workable algorithm.
As of the apostrophe heuristics, as much as 30% of the cpu time seem
to be spent on this, regardless of there being any apostrophes in the
text. So, I consider that hard in the sense that it is a very high
cost for very little functionality. It might be possible to get rid
of this overhead at the cost of higher implementation complexity
though.
We've had lots of parser projects, and I'm
sure some have handled
those.
Point me to one that has.
The hard part is coming up with a practical way to
integrate
the new and simplified parser into MediaWiki in such a way that it can
actually be used at some point on sites like Wikipedia. Do you have
any plans for how to do that?
Developing a fully featured integration would require a large amount
of work. But I wouldn't call it hard. I haven't analysed all of the
hooks, and it is possible that some difficulties will turn up when
implementing emulation of these. But other than that, I cannot see
that the integration work would consist of anything but a long list of
relatively simple tasks. If a project were to be launced to perform
the integration, I would feel confidident that it would reach its
goal. Considering the large body of information that is encoded in
MediaWiki syntax, I would guess that there is a strong
interest in actually spending efforts on this.
/Andreas