Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

6 May 2011

2011-05-06 03:27, Andrew Garrett skrev:
...
  On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
 &lt;andreas.jonsson(a)kreablo.se&gt; wrote:
  2011-05-04 08:13, Tim Starling skrev:
  On 04/05/11 15:52, Andreas Jonsson wrote:
  The time it takes to execute the code that glues
together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  This is at least the case for
 the articles are the the easiest for the core parser, which are articles
 that contains no markup.  The more markup the slower it will run.  It is
 possible that this slowdown will be lessened if compiled with HipHop.
 But the top speed of the parser (in bytes/seconds) will be largely
 unaffected. 
 PHP execution dominates for real test cases, and HipHop provides a
 massive speedup. See the previous HipHop thread.

 http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html

 Unfortunately, users refuse to write articles consisting only of
 hundreds of kilobytes of plain text, they keep adding references and
 links and things. So we don't really care about the parser's "top
speed". 
 We are talking about different things.  I don't consider callbacks made
 when processing "magic words" or "parser functions" being part of
the
 actual parsing.  The reference case of no markup input is interesting to
 me as it marks the maximum throughput of the MediaWiki parser, and is
 what you would compare alternative implementations to.  But, obviously,
 if the Barack Obama article takes 22 seconds to render, there are more
 severe problems than parser performance at the moment.  
 It's a little more complicated than that, and obviously you haven't
 spent a lot of time looking at profiling output from parsing the
 Barack Obama article if you say that — what, if not the parser, is
 slowing down the processing of that article?

 Consider the following:

 1. Many things that you would exclude from "parsing" like reference
 tags and what-not call the parser themselves.
 2. Regardless of whether you include the actual callback in your
 measurements of parser run time, you need to consider them.
 Identifying structures that require callbacks, as well as structures
 that don't (such as links, templates, images, and what not) takes
 time. While you might reasonably exclude ifexist calls and so on from
 parser run time, you most certainly cannot reasonably exclude template
 calls, link processing, nor the extra time taken by the preprocessor
 to identify such structures.

 As Domas says, real world data is king. As far as I know, in the case
 of 'a a a a', even if you repeat it for a few MB, virtually no PHP
 code is run, because the preprocessor uses strcpsn to identify
 structures requiring preprocessing. That's implemented in C — in fact,
 for 'a a a' repeated for a few MB, it's my (probably totally wrong)
 understanding that the PHP code runs in more or less constant time.
 It's the structures that appear in real articles that make the parser
 slow. 
I'm sorry, I misunderstood the original statement that HipHop would
make _parsing_ significantly faster and questioned that on false
premises, because I'm thinking of the parser and the preprocessor as
distinctly different components.

Let me explain: as I see it, the first step in formalizing wikitext
syntax is to analyze and write a parser that can be used as a drop in
replacement after preprocessing.  The stuff that is preprocessed
cannot be integrated with the parser without sacrificing compatiblity.
Preprocessing is problematic.  It breaks the one-to-one relationship
with the wikitext and the syntax tree, (i.e., it impossible to
serialize a syntax tree back to the same wikitext that generated it.)
Therefore, in a second step, it should be analyzed how the
preprocessed constructions can be integrated with the parser and how
to minimize the damage from this change.

I had not analyzed the parts of the core parser that I consider
"preproprocessing", and it came as a suprise to me that it was as slow
as the Barack Obama benchmark shows.  But integrating template
expansion with the parser would solve this performance problem, and is
therefore in itself a strong argument for working towards replacing
it.  I will write about this on wikitext-l.

Best Regards,

Andreas Jonsson

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)