On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
<andreas.jonsson(a)kreablo.se> wrote:
2011-05-04 08:13, Tim Starling skrev:
On 04/05/11 15:52, Andreas Jonsson wrote:
The time it takes to execute the code that glues
together the regexps
will be insignificant compared to actually executing the regexps for any
article larger than a few hundred bytes. This is at least the case for
the articles are the the easiest for the core parser, which are articles
that contains no markup. The more markup the slower it will run. It is
possible that this slowdown will be lessened if compiled with HipHop.
But the top speed of the parser (in bytes/seconds) will be largely
unaffected.
PHP execution dominates for real test cases, and HipHop provides a
massive speedup. See the previous HipHop thread.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
Unfortunately, users refuse to write articles consisting only of
hundreds of kilobytes of plain text, they keep adding references and
links and things. So we don't really care about the parser's "top
speed".
We are talking about different things. I don't consider callbacks made
when processing "magic words" or "parser functions" being part of
the
actual parsing. The reference case of no markup input is interesting to
me as it marks the maximum throughput of the MediaWiki parser, and is
what you would compare alternative implementations to. But, obviously,
if the Barack Obama article takes 22 seconds to render, there are more
severe problems than parser performance at the moment.
It's a little more complicated than that, and obviously you haven't
spent a lot of time looking at profiling output from parsing the
Barack Obama article if you say that — what, if not the parser, is
slowing down the processing of that article?
Consider the following:
1. Many things that you would exclude from "parsing" like reference
tags and what-not call the parser themselves.
2. Regardless of whether you include the actual callback in your
measurements of parser run time, you need to consider them.
Identifying structures that require callbacks, as well as structures
that don't (such as links, templates, images, and what not) takes
time. While you might reasonably exclude ifexist calls and so on from
parser run time, you most certainly cannot reasonably exclude template
calls, link processing, nor the extra time taken by the preprocessor
to identify such structures.
As Domas says, real world data is king. As far as I know, in the case
of 'a a a a', even if you repeat it for a few MB, virtually no PHP
code is run, because the preprocessor uses strcpsn to identify
structures requiring preprocessing. That's implemented in C — in fact,
for 'a a a' repeated for a few MB, it's my (probably totally wrong)
understanding that the PHP code runs in more or less constant time.
It's the structures that appear in real articles that make the parser
slow.
I'm sorry, I misunderstood the original statement that HipHop would
make _parsing_ significantly faster and questioned that on false
premises, because I'm thinking of the parser and the preprocessor as
distinctly different components.
Let me explain: as I see it, the first step in formalizing wikitext
syntax is to analyze and write a parser that can be used as a drop in
replacement after preprocessing. The stuff that is preprocessed
cannot be integrated with the parser without sacrificing compatiblity.
Preprocessing is problematic. It breaks the one-to-one relationship
with the wikitext and the syntax tree, (i.e., it impossible to
serialize a syntax tree back to the same wikitext that generated it.)
Therefore, in a second step, it should be analyzed how the
preprocessed constructions can be integrated with the parser and how
to minimize the damage from this change.
I had not analyzed the parts of the core parser that I consider
"preproprocessing", and it came as a suprise to me that it was as slow
as the Barack Obama benchmark shows. But integrating template
expansion with the parser would solve this performance problem, and is
therefore in itself a strong argument for working towards replacing
it. I will write about this on wikitext-l.
Best Regards,
Andreas Jonsson