[Wikitech-l] Re: Chat about Wikipedia performance?

2 May 2003

Quoting various people:

...
  Changes in
'existence' (yes or no) come up infrequently.  When  someone
 creates a brand new article, all other cached
articles that are
affected by the change could be updated at that time. ...
  More than that, you don't have to re-generate the cached pages, you
 only have to invalidate them.
 Thus, when a page is created, you have to "touch" all pages with a
 link to that not-yet-existing article,
 and when a page is deleted, you do the same to all pages that link to 
...
  that article. 
Yes.  If stored in a filesystem, the caches of generated HTML for
the linking articles can be simply removed when the existence state
of an article is changed.  This would be relatively rare; looking
at the "recent changes" log shows clearly that most edits are
of EXISTING articles.  Editing a previous existing article shouldn't
modify any caches except of the article being edited.

...
...
  By not updating articles until they are accessed, you
can defer a lot
 of 
 work that would otherwise bog the system down at update time. 
Yes.  It also means that if someone creates a number of
related "previously non-existing" articles, cached HTML files are
only created when they're needed.

...
  Lazy 
 evaluation is much nicer: when the page is demanded, the code should 
 first look for a cached page, and generate it if necessary, before
 using 
 that data to generate the article output. 
Agreed.

...
  Potential downsides:
 * some articles are linked by 1000s+ of pages (like the pages linked
 by 
 auto-generated articles). Hitting these could cause a significant
 pause 
 as the cache is invalidated -- or will it? 
No, it shouldn't be a problem in normal circumstances.
An HTML cache should only be changed if
the EXISTENCE of a cached page changes. If an article is
linked to by 1000s+ of pages, it almost certainly already exists,
so no cache (other than the one edited article) would be invalidated.

If an article is created (and didn't exist before), or completely
deleted from the database, I think the usual case is that it would
have have relatively few links (say, 0-10).
Thus, you'd only have a few cache invalidations.

The NON-normal case would actually be interesting.
If an article was widely linked - but didn't exist - then it's
likely that someone finally created "the article everybody wanted".
If that's so, a small hick-up to store a widely-desired article would
be reasonable - people would be glad for the article!
In practice, I doubt there'd be many - the people who monitor the
"most requested" article lists will create article before too many
people link to the non-existent article.
Altenernatively, perhaps all those links are a creation from a
vandalizing bot - in which case, you'd like to know about it.
If an article is deleted, and MANY pages refer to it, I'd
worry - that could signal serious vandalism.

Just to be clear: I'm specifically talking about
caches of stored generated HTML in this email.
This could be done by "front end" web servers inside their
filesystem, without touching MySQL, as long as the web servers
were told when to invalidate their caches.
Say, via a separate process that simply get told via broadcast to
"invalidate cache of article X" - it then removes the
corresponding file (there's a risk of getting "old" articles in
some circumstances - whether or not that's a problem is worth
discussion). You can even imagine, say, 4-5 front-end
webservers, with caches of article to serve read requests,
and only talking to the database when updates occur.

A separate issue is the possibility of storing _all_ of the article
text in the filesystem, instead of MySQL.
If that's done, a cache may or may not be useful; it depends on
how much it takes to render the HTML from the wikitext.
If the rendering to HTML is sufficiently fast, it's possible that
the memory pressure from having both cached & original text will
cause the system to be less efficient than if it regenerated the
text each time.  I suspect caching would still be useful, even
if the original text is in the filesystem; sendfile() and friends
are blazingly fast, while PHP working to render the HTML
simply can't be.

It'd be possible to store both the original wikitext and the
rendered HTML in the filesystem.  If you did that, perhaps they
should be in separate directories,
to simplify moving to multiple front-ends if that's desired later.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: Chat about Wikipedia performance?