No subject

Tue Dec 30 21:05:05 UTC 2008

> We do not even have a parser. I am sure you know that MediaWiki does not
> actually parse. It is 5000 lines worth of regexes, for the most part.

"Parser" is a convenient and short name for it.

I've reviewed all of the regexes, and I stand by the vast majority of
them. The PCRE regular expression module is a versatile text scanning
language, which is compiled to bytecode and executed in a VM, very much
like PHP. It just so happens that for most text processing tasks where
there is a choice between PHP or PCRE, PCRE is faster. In certain special
cases, it's possible to gain extra performance by using primitive text
scanning functions like strpos() which are implemented in C. Where this is
possible, I have done so. But if you want to, say, find the first match
from a list of strings in a single subject, searching from a given offset,
then the fastest way to do it in standard PHP is a regex with the /S modifier.

In two cases, I found the available algorithms accessible from standard
PHP to be inconveniently slow, so I wrote the FSS and wikidiff2 extensions
in C and C++ respectively.

Perhaps, like so many computer science graduates, you are enamored with
the taxonomy of formal grammars and the parsers that go with them. There
are a number of problems with these traditional solutions.

Firstly, there are theoretical problems. The concept of a regular grammar
is not versatile enough to describe languages such as XML, and not
descriptive enough to allow unambiguous parse tree production from a
language like wikitext. It's trivial to invent irregular grammars which
can be nonetheless processed in linear time. My aims for wikitext, namely
that it be easy for humans to write but fast to convert to HTML, do not
coincide well with the taxonomy of formal grammars.

Secondly, there are practical problems. Past projects attempting to parse
wikitext using flex/bison or similar schemes have failed to achieve the
performance of the present parser, which is surprising because I didn't
think I was setting the bar very high. You can bet that if I ever rewrote
it in C++ myself, it would be much faster. The PHP compiler community is
currently migrating away from LALR towards a regex-based parser called
re2c, mostly for performance reasons.

Thirdly, there is the fact that certain phases of MediaWiki's parser are
already very similar to the textbook parsers and can be analysed in those
terms. The main difference is that our parser is better optimised. For
example, the preprocessor acts like a recursive descent parser, but with a
non-recursive frontend (using an internal stack), a caching phase, and a
parse tree expansion phase with special-case recursive to iterative
transformations to minimise stack depth.

Yet another post:
> I don't believe a computer scientist would have a huge problem writing 
> a proper parser. Are any of the core developers computer scientists?

Frankly, as an ex-physicist, I don't find the field of computer science
particularly impressive, either in terms of academic rigour or practical
applications. I think my time would be best spent working as a software
engineer for a cause that I believe in, rather than going back to
university and studying another socially-disconnected field.

-- Tim Starling