Jimmy Wales wrote:
Erik Moeller wrote:
nice to see you here -- I enjoyed reading your
Linux/OSS-related papers. I
have to say that disabling link checking on the live Wikipedia, even for a
short time, is hardly acceptable. It is essential for both readers and
authors.
I know I expressed sympathy to disabling link checking yesterday or the
day before, but several developers have come out against it, and I'm now
swayed to the view that it's an essential, not a frill.
I now wonder about our usage patterns and whether it would be best to
update a cache at edit-time, rather than at read-time.
Changes in 'existence' (yes or no) come up infrequently. When someone
creates a brand new article, all other cached articles that are
affected by the change could be updated at that time. So a hundred
times a day (if that), we have to do a fancy cache update to change
affected other articles. But most edits don't affect other articles,
because they are edits to already existing pages.
More than that, you don't have to re-generate the cached pages, you only
have to invalidate them.
Thus, when a page is created, you have to "touch" all pages with a link
to that not-yet-existing article,
and when a page is deleted, you do the same to all pages that link to
that article.
This is good, because page access patterns are so sparse: many pages go
days or weeks without being accessed, but traffic is high because there
are so many pages in aggregate.
By not updating articles until they are accessed, you can defer a lot of
work that would otherwise bog the system down at update time. Lazy
evaluation is much nicer: when the page is demanded, the code should
first look for a cached page, and generate it if necessary, before using
that data to generate the article output.
Note that editing the page itself would go through the same code-path:
just store the new content, invalidate the cached page, and then serve
the page, forcing a re-render.
What can be cached:
* wiki parsing
* link lookup
* article content HTML generation
What can't be cached:
* page skin (changes per user)
* user details (ditto)
* menu links (ditto)
* things like {{NUMBEROFARTICLES}}
However, applying these as a final pass should be much cheaper than
current page serving.
Potential downsides:
* some articles are linked by 1000s+ of pages (like the pages linked by
auto-generated articles). Hitting these could cause a significant pause
as the cache is invalidated -- or will it?
-- Neil