David A. Wheeler wrote
Potential downsides:
* some articles are linked by 1000s+ of pages (like the pages linked
by
auto-generated articles). Hitting these could cause a significant
pause
as the cache is invalidated -- or will it?
No, it shouldn't be a problem in normal circumstances.
An HTML cache should only be changed if
the EXISTENCE of a cached page changes. If an article is
linked to by 1000s+ of pages, it almost certainly already exists,
so no cache (other than the one edited article) would be invalidated.
By "hitting" in this case, I meant "deleting or creating".
Sorry for the
ambiguity.
Take a look at something like [[census]], which is linked by about
36.000 articles.
If an article is created (and didn't exist before),
or completely
deleted from the database, I think the usual case is that it would
have have relatively few links (say, 0-10).
Thus, you'd only have a few cache invalidations.
The NON-normal case would actually be interesting.
If an article was widely linked - but didn't exist - then it's
likely that someone finally created "the article everybody wanted".
If that's so, a small hick-up to store a widely-desired article would
be reasonable - people would be glad for the article!
In practice, I doubt there'd be many - the people who monitor the
"most requested" article lists will create article before too many
people link to the non-existent article.
Altenernatively, perhaps all those links are a creation from a
vandalizing bot - in which case, you'd like to know about it.
If an article is deleted, and MANY pages refer to it, I'd
worry - that could signal serious vandalism.
I agree. This is interesting to think about.
Just to be clear: I'm specifically talking about
caches of stored generated HTML in this email.
This could be done by "front end" web servers inside their
filesystem, without touching MySQL, as long as the web servers
were told when to invalidate their caches.
Say, via a separate process that simply get told via broadcast to
"invalidate cache of article X" - it then removes the
corresponding file (there's a risk of getting "old" articles in
some circumstances - whether or not that's a problem is worth
discussion). You can even imagine, say, 4-5 front-end
webservers, with caches of article to serve read requests,
and only talking to the database when updates occur.
This is how I think things will end up eventually: but there are lots of
data integrity and race condition / transaction issues to be thought
about before any of this can be implemented. Let's finish splitting the
system into two machines, DB and WWW, before any re-architecture is
performed.
A separate issue is the possibility of storing _all_ of
the article
text in the filesystem, instead of MySQL.
If that's done, a cache may or may not be useful; it depends on
how much it takes to render the HTML from the wikitext.
If the rendering to HTML is sufficiently fast, it's possible that
the memory pressure from having both cached & original text will
cause the system to be less efficient than if it regenerated the
text each time. I suspect caching would still be useful, even
if the original text is in the filesystem; sendfile() and friends
are blazingly fast, while PHP working to render the HTML
simply can't be.
It'd be possible to store both the original wikitext and the
rendered HTML in the filesystem. If you did that, perhaps they
should be in separate directories,
to simplify moving to multiple front-ends if that's desired later.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikipedia.org
http://www.wikipedia.org/mailman/listinfo/wikitech-l