Parser implementaton for MediaWiki syntax

List overview All Threads
Download

newer

older

Balancing MediaWiki Core/Extensions

Pending Changes development...

Andreas Jonsson

23 Sep 2010 23 Sep '10

9:27 a.m.

Hi, I have written a parser for MediaWiki syntax and have set up a test site for it here: http://libmwparser.kreablo.se/index.php/Libmwparsertest and the source code is available here: http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser A preprocessor will take care of parser functions, magic words, comment removal, and transclusion. But as it wasn't possible to cleanly separate these functions from the existing preprocessor, some preprocessing is disabled at the test site. It should be straightforward to write a new preprocessor that provides only the required functionality, however. The parser is not feature complete, but the hard parts are solved. I consider "the hard parts" to be: * parsing apostrophes * parsing html mixed with wikitext * parsing headings and links * parsing image links And when I say "solved" I mean producing the same or equivalent output as the original parser, as long as the behavior of the original parser is well defined and produces valid html. Here is a schematic overview of the design: +-----------------------+ | | Wikitext | client application +---------------------------------------+ | | | +-----------------------+ | ^ | | Event stream | +----------+------------+ +-------------------------+ | | | | | | | parser context |<------>| Parser | | | | | | | +-----------------------+ +-------------------------+ | ^ | | Token stream | +-----------------------+ +------------+------------+ | | | | | | | lexer context |<------>| Lexer |<---+ | | | | +-----------------------+ +-------------------------+ The design is described more in detail in a series of posts at the wikitext-l mailing list. The most important "trick" is to make sure that the lexer never produce a spurious token. An end token for a production will not appear unless the corresponding begin token already has been produced, and the lexer maintains a block context to only produce tokens that makes sense in the current block. I have used Antlr for generating both the parser and the lexer, as Antlr supports semantic predicates that can be used for context sensitive parsing. Also I am using a slightly patched version of Antlr's C runtime environent, because the lexer needs to support speculative execution in order to do context sensitive lookahead. A Swig generated interface is used for providing the php api. The parser process the buffer of the php string directly, and writes its output to an array of php strings. Only UTF-8 is supported at the moment. The performance seems to be about the same as for the original parser on plain text. But with an increasing amount of markup, the original parser runs slower. This new parser implementation maintains roughly the same performance regardless of input. I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however. Best regards, Andreas

Show replies by date

Bryan Tong Minh

23 Sep 23 Sep

9:34 a.m.

Hi, Pretty awesome work you've done! On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

...

I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however.

Have you already tried to run the parsertests that come with MediaWiki? Do they produce (roughly) the same output as with the PHP parser? Bryan

Robin Krahl

12:10 p.m.

On 23.09.2010 11:34, Bryan Tong Minh wrote:

...

Pretty awesome work you've done!

+1. I just was about writing an own parser for an application, but that’s really great. :) Robin

Andreas Jonsson

12:14 p.m.

2010-09-23 11:34, Bryan Tong Minh skrev:

...

Hi, Pretty awesome work you've done! On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however.

Have you already tried to run the parsertests that come with MediaWiki? Do they produce (roughly) the same output as with the PHP parser?

No, I haven't. I have produced my own set of unit tests that are based on the original parser. For the features that I have implemented, the output should be roughly the same under "normal" circumstances. But the original parser have tons of border cases where the behavior is not very well defined. For instance, the table on the test page will render very differently with the original parser (it will actually turn into two separate tables). I am employing a consistent and easily understood strategy for handling html intermixed with wikitext markup; it is easy to explain that the |} token is disabled in the context of an html-table. There is no such simple explanation for the behavior of the original parser, even though in this particular example the produced html code happens to be valid (which isn't always the case). So, what I'm trying to say is that for the border cases where my implementation differs from the original, the behavior of my parser should be considered the correct one. :-) /Andreas

Krinkle

12:17 p.m.

Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:

...

2010-09-23 11:34, Bryan Tong Minh skrev:

Hi, Pretty awesome work you've done! On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however.

Have you already tried to run the parsertests that come with MediaWiki? Do they produce (roughly) the same output as with the PHP parser?

Hm... Depending on how 'edge' those edge cases are, and on how much they are known. Doing that would may render it unusable for established wikis and would never become the default anytime soon, right ? -- Krinkle

Andreas Jonsson

12:47 p.m.

2010-09-23 14:17, Krinkle skrev:

...

Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:

2010-09-23 11:34, Bryan Tong Minh skrev:

Hi, Pretty awesome work you've done! On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however.

Have you already tried to run the parsertests that come with MediaWiki? Do they produce (roughly) the same output as with the PHP parser?

We are talking about the edge cases that arise when intermixing wikitext and html code in "creative" ways. This is for instance ok with the original parser: * item 1 <li> item 2 * item 3 That may seem harmless and easy to handle, but suprise! explicitly adding the </li> token doesn't work as expected: * item 1 <li> item 2 </li> * item 3 And what happens when you add a new html list inside a wikitext list item without closing it? * item 1 <ul><li> item 2 * item 3 Which list should item 3 belong to? You can can come up with thousands of situations like this, and without a consistent plan on how to handle them, you will need to add thousands of border cases to the code to handle them all. I have avoided this by simply disabling all html block tokens inside wikitext list items. Of course, it may be that someone is actually relying on being able to mix in this way, but it doesn't seem likely as the result tends to be strange. /Andreas

Krinkle

12:56 p.m.

Op 23 sep 2010, om 14:47 heeft Andreas Jonsson het volgende geschreven:

...

2010-09-23 14:17, Krinkle skrev:

Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:

2010-09-23 11:34, Bryan Tong Minh skrev:

Hi, Pretty awesome work you've done! On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote: > I think that this demonstrates the feasability of replacing the > MediaWiki parser. There is still a lot of work to do in order to > turn > it into a full replacement, however. > > > Have you already tried to run the parsertests that come with MediaWiki? Do they produce (roughly) the same output as with the PHP parser?

I agree that making in consistant is important and will only cause good things (such as people getting used to behaviour and being able to predict what something would logically do). About the html in wikitext mixup: Although not directly, it is most certainly done indirectly. Imagine a template which is consists of a table in wikitext. A certain parameters value is outputted in a table cel. On some page that template is called and the parameter is filled with the help of a parser function (like #if or #expr). To avoid mess and escape templates, the table inside this table cell is build from there in HTML in a lot of cases instead of wiki text (pipe problem, think {{!}}) Result is html table in wikitext table. Or for example the thing with whitespace and parser functions / template parameters. Starting something like a table or list requires the block level hack (like <br /> or <div></div> after the pipe, and then the {| table |} or *list on the next time). To avoid those in complex templates often HTML is used. If that template would be called on a page with an already existing wikitext list in place there would be a html list inside a wikitext list. I dont know in which order the parser works, but I think if the behaviour changes of that lots of complicated templates will break, and not just on Wikimedia projects. -- Krinkle

Andreas Jonsson

1:25 p.m.

2010-09-23 14:56, Krinkle skrev:

...

Op 23 sep 2010, om 14:47 heeft Andreas Jonsson het volgende geschreven:

2010-09-23 14:17, Krinkle skrev:

Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:

2010-09-23 11:34, Bryan Tong Minh skrev: > Hi, > > > Pretty awesome work you've done! > > On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson > <andreas.jonsson(a)kreablo.se> wrote: > > > >> I think that this demonstrates the feasability of replacing the >> MediaWiki parser. There is still a lot of work to do in order to >> turn >> it into a full replacement, however. >> >> >> >> > Have you already tried to run the parsertests that come with > MediaWiki? Do they produce (roughly) the same output as with the > PHP > parser? > > > > No, I haven't. I have produced my own set of unit tests that are based on the original parser. For the features that I have implemented, the output should be roughly the same under "normal" circumstances. But the original parser have tons of border cases where the behavior is not very well defined. For instance, the table on the test page will render very differently with the original parser (it will actually turn into two separate tables). I am employing a consistent and easily understood strategy for handling html intermixed with wikitext markup; it is easy to explain that the |} token is disabled in the context of an html-table. There is no such simple explanation for the behavior of the original parser, even though in this particular example the produced html code happens to be valid (which isn't always the case). So, what I'm trying to say is that for the border cases where my implementation differs from the original, the behavior of my parser should be considered the correct one. :-) /Andreas _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

We are talking about the edge cases that arise when intermixing wikitext and html code in "creative" ways. This is for instance ok with the original parser: * item 1<li> item 2 * item 3 That may seem harmless and easy to handle, but suprise! explicitly adding the</li> token doesn't work as expected: * item 1<li> item 2</li> * item 3 And what happens when you add a new html list inside a wikitext list item without closing it? * item 1<ul><li> item 2 * item 3 Which list should item 3 belong to? You can can come up with thousands of situations like this, and without a consistent plan on how to handle them, you will need to add thousands of border cases to the code to handle them all. I have avoided this by simply disabling all html block tokens inside wikitext list items. Of course, it may be that someone is actually relying on being able to mix in this way, but it doesn't seem likely as the result tends to be strange. /Andreas

Yes, but that is supported by the parser. What isn't supported is mixing tokens from html tables with tokens from a wikitext table. So you have: <table><td>this is a column inside an html table, and as such, | token and |- token are disabled. However, {| | opens up a wikitext table, which changes the context so that now <td> <tr> and </table> tokens are disabled. But it is still possible to once again <table><td> open up a html table and thus the context is switched so that the |} token is disabled. </table> |} </table> And here we're back to an ordinary paragraph.

...

Or for example the thing with whitespace and parser functions / template parameters. Starting something like a table or list requires the block level hack (like<br /> or<div></div> after the pipe, and then the {| table |} or *list on the next time). To avoid those in complex templates often HTML is used. If that template would be called on a page with an already existing wikitext list in place there would be a html list inside a wikitext list.

A feasible alternative is to parse these as inline block elements inside wikitext list elements, which I'm already doing for image links with caption. But I think that it is preferable to just disable them.

...

I dont know in which order the parser works, but I think if the behaviour changes of that lots of complicated templates will break, and not just on Wikimedia projects.

That's possible, but I believe that the set of broken templates can be limited to a great extent. To deploy a new parser on an existing site, one would need a tool that walks the existing pages and warns about suspected problems. /Andreas

David Gerard

3:18 p.m.

On 23 September 2010 14:25, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

...

Presumably it would be feasible to get a current pages (and templates) dump and compare the results of the old parser and the new parser for problematic constructs that are used in practice. In fact, you'd pretty much have to do something like that for a replacement parser to be considered a substitute. - d.

Aryeh Gregor

26 Sep 26 Sep

6:57 p.m.

On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

...

. . . You can can come up with thousands of situations like this, and without a consistent plan on how to handle them, you will need to add thousands of border cases to the code to handle them all. I have avoided this by simply disabling all html block tokens inside wikitext list items. Of course, it may be that someone is actually relying on being able to mix in this way, but it doesn't seem likely as the result tends to be strange.

The way the parser is used in real life is that people just write random stuff until it looks right. They wind up hitting all sorts of bizarre edge cases, and these are propagated to thousands of pages by templates. A pure-PHP parser is needed for end users who can't install binaries, and any replacement parser must be compatible with it in practice, not just on the cases where the pure-PHP parser behaves sanely. In principle, we might be able to change parser behavior in lots of edge cases and let users fix the broken stuff, if the benefit is large enough. But we'd have to have a pure-PHP parser that implements the new behavior too. The parts you considered to be the hard parts are not that hard. We've had lots of parser projects, and I'm sure some have handled those. The hard part is coming up with a practical way to integrate the new and simplified parser into MediaWiki in such a way that it can actually be used at some point on sites like Wikipedia. Do you have any plans for how to do that?

Andreas Jonsson

27 Sep 27 Sep

7:38 a.m.

2010-09-26 20:57, Aryeh Gregor skrev:

...

On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

Yes, this is a problem.

...

A pure-PHP parser is needed for end users who can't install binaries, and any replacement parser must be compatible with it in practice, not just on the cases where the pure-PHP parser behaves sanely. In principle, we might be able to change parser behavior in lots of edge cases and let users fix the broken stuff, if the benefit is large enough. But we'd have to have a pure-PHP parser that implements the new behavior too.

Antlr is a multi language parser generator. Unfortunately PHP is not currently on the list of target languages. Porting the back end to PHP is a feasible task, however. Likewise, porting my parser implementation to PHP is feasible. Then the later question is if you want to maintain two language versions to also have the performance advantage of the C parser.

...

The parts you considered to be the hard parts are not that hard.

What support do you have for this claim? Parsing wikitext is difficult, because of the any-input-is-valid-wikitext philosphy. Parsing MediaWiki wikitext is very difficult, since it is not designed to be parsable. I consider the parts I pointed out to be hard because they cannot be implemented with standard parser techniques. I've developed a state machine for enabling/disabling individual token productions depending on context; I've employed speculative execution in the lexical analysator to support context sensitive lookahead. I don't believe that you will find these techniques in any text book on compiler design. So I consider these items hard in the sense that before I started working on my implementation, it was not at all clear that I would be able to find a workable algorithm. As of the apostrophe heuristics, as much as 30% of the cpu time seem to be spent on this, regardless of there being any apostrophes in the text. So, I consider that hard in the sense that it is a very high cost for very little functionality. It might be possible to get rid of this overhead at the cost of higher implementation complexity though.

...

We've had lots of parser projects, and I'm sure some have handled those.

Point me to one that has.

...

The hard part is coming up with a practical way to integrate the new and simplified parser into MediaWiki in such a way that it can actually be used at some point on sites like Wikipedia. Do you have any plans for how to do that?

Developing a fully featured integration would require a large amount of work. But I wouldn't call it hard. I haven't analysed all of the hooks, and it is possible that some difficulties will turn up when implementing emulation of these. But other than that, I cannot see that the integration work would consist of anything but a long list of relatively simple tasks. If a project were to be launced to perform the integration, I would feel confidident that it would reach its goal. Considering the large body of information that is encoded in MediaWiki syntax, I would guess that there is a strong interest in actually spending efforts on this. /Andreas

Aryeh Gregor

5:42 p.m.

On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

...

Point me to one that has.

Maybe I'm wrong. I've never looked at them in depth. I don't mean to be discouraging here. If you can replace the MediaWiki parser with something sane, my hat is off to you. But if you don't receive a very enthusiastic response from established developers, it's probably because we've had various people trying to replace MediaWiki's parser with a more conventional one since like 2003, and it's never produced anything usable in practice. The prevailing sentiment is reflected pretty well in Tim's commit summary from shortly before giving you commit access: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71620 Maybe we're just pessimistic, though. I'd be happy to be proven wrong!

...

For practical purposes, something that requires a large amount of fiddly and boring work counts as hard too. It can stall a project as easily as anything else.

Ashar Voultoiz

6:05 p.m.

On 27/09/10 19:42, Aryeh Gregor wrote: <snip>

...

:: we've had various people trying to replace MediaWiki's parser with a more conventional one since like 2003, and it's never produced anything usable in practice.

JeLuF wrote a token based parser back in 1.3. I am almost sure it has been used on live site. It eventually got disabled after a couple of days due to perf issues : http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/10496 I have found some old code in REL1_3 : http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_3/phase3/includes/T… That was before transclusions, categories ... -- Ashar Voultoiz

Andreas Jonsson

28 Sep 28 Sep

7:19 a.m.

2010-09-27 20:05, Ashar Voultoiz skrev:

...

On 27/09/10 19:42, Aryeh Gregor wrote: <snip>

:: we've had various people trying to replace MediaWiki's parser with a more conventional one since like 2003, and it's never produced anything usable in practice.

295 lines of code? I don't see how that could be anywhere near compatible. /Andreas

Ashar Voultoiz

8:50 p.m.

On 28/09/10 09:19, Andreas Jonsson wrote:

...

295 lines of code? I don't see how that could be anywhere near compatible.

IIRC it was. Six years ago the syntax was simpler. -- Ashar Voultoiz

Chad

27 Sep 27 Sep

6:58 p.m.

On Mon, Sep 27, 2010 at 1:42 PM, Aryeh Gregor <Simetrical+wikilist(a)gmail.com> wrote:

...

On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

Point me to one that has.

This. Tim sums up the consensus very well with that commit summary. He also made some comments on the history of wikitext and alternative parsers on foundation-l back in Jan '09[0]. Worth a read (starting mainly at ""Parser" is a convenient and short name for it"). While a real parser is a nice pipe dream, in practice not a single project to "rewrite the parser" has succeeded in the years of people trying. Like Aryeh says, if you can pull it off and make it practical, hats off to you. -Chad [0] http://article.gmane.org/gmane.org.wikimedia.foundation/35876/

Paul Houle

8:46 p.m.

On 9/27/2010 2:58 PM, Chad wrote:

...

For my own IX work I've written a wikimedia markup parser in C# based on the Irony framework. It fails to parse about 0.5% of pages in wikipedia and is oblivious to a lot of the stranger stuff [like the HTML intrusions] but it does a good job of eating infoboxes and making sense of internal and external links. Now, the strange stuff + the parse fails would probably be impossible to handle in a rational way...

Andreas Jonsson

28 Sep 28 Sep

7:53 a.m.

2010-09-27 22:46, Paul Houle skrev:

...

On 9/27/2010 2:58 PM, Chad wrote:

For my own IX work I've written a wikimedia markup parser in C# based on the Irony framework. It fails to parse about 0.5% of pages in wikipedia

What do you mean with "fail". It assigns slightly incorrect semantic to a construction? It fails to accept the input? It crashes?

...

and is oblivious to a lot of the stranger stuff [like the HTML intrusions] but it does a good job of eating infoboxes and making sense of internal and external links. Now, the strange stuff + the parse fails would probably be impossible to handle in a rational way...

I disagree. I believe that there is a rational way to handle all kinds of input. /Andreas

Paul Houle

2 p.m.

On 9/28/2010 3:53 AM, Andreas Jonsson wrote:

...

For my own IX work I've written a wikimedia markup parser in C# based on the Irony framework. It fails to parse about 0.5% of pages in wikipedia

What do you mean with "fail". It assigns slightly incorrect semantic to a construction? It fails to accept the input? It crashes?

Fails to accept input -- that is, the text doesn't match the grammar. Now, the toolchain above the parser gets between 30-80% recall at the moment doing the things it has to do, so making the grammar better isn't the highest priority on my list.

Andreas Jonsson

7:41 a.m.

2010-09-27 20:58, Chad skrev:

...

On Mon, Sep 27, 2010 at 1:42 PM, Aryeh Gregor <Simetrical+wikilist(a)gmail.com> wrote:

On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

Point me to one that has.

So, Tim are raising three objections against a more formalized parser: 1. Formal grammars are too restricted for wikitext. My implementation represents a greater class of grammars than the class of context free grammars. I believe that this gives sufficient space for wikitext. 2. Previous parser implementation had performance issues. I have not rigourusly tested the performance of my parser, but it is linear to the size of the input complexity and seems to be comparable to the original parser on plain text. Whith increasing amount of markup, the original parser seems to degrade in performance, while my implementation maintains a fairly constant speed, regardless of input. It is possible to construct malicous input that cause the performance my parser to be offset with a constant (the same content scanned up to 13 times). But this is not a situation that would occur on a normal page. 3. Some aspects of the existing parser follows well known parser algorithms, but is better optimized. In particular, the preprocessor. My parser implementation does not preprocess the content. I acknowledge that preprocessing is better done by the current preprocessor. One just need to detangle the independent preprocessing (parser functions, transclusion, magic words etc.) from the parser preparation preprocessing (e.g., replacing <nowiki> ... </nowiki> with "magic" string). Regarding optimization, it doesn't matter that the current parser is "optimized" if my unoptimized implementation outperforms the existing optimized one. /Andreas

Andreas Jonsson

7:18 a.m.

2010-09-27 19:42, Aryeh Gregor skrev:

...

On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:

Point me to one that has.

I'm aware of the previous attempts, and I understand that people are being sceptic. But now I'm claiming that I have made significant progress, and the implementation I'm presenting is my proof.

...

For practical purposes, something that requires a large amount of fiddly and boring work counts as hard too. It can stall a project as easily as anything else.

I agree. But I don't believe that any of previous attempts at writing a parser have failed because of the integration. /Andreas

4993

days inactive

4998

days old

wikitech-l@lists.wikimedia.org

Manage subscription

20 comments

9 participants

tags (0)

participants (9)

Andreas Jonsson
Aryeh Gregor
Ashar Voultoiz
Bryan Tong Minh
Chad
David Gerard
Krinkle
Paul Houle
Robin Krahl