After a very informative exchange with Tim Starling, I've thought a bit
more about my proposal last night about making Wikipedia cacheable,
which in the light of day seems excessively complex. Here's a simpler
version:
At the moment, according to Tim, all Wikipedia pages are served as
uncacheable pages, thus preventing any intermediate proxy caches from
cahing them -- a quick packet dump shows that they are served with
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
whether or not I am logged in. Clearly, if Wikipedia content was
cacheable, there would be massive bandwidth gains, but the current
policy is designed to prevent caches serving out-of-date content, or the
same page to anons and logged-in users.
I'd be interested in the effects if we were to serve most ordinary pages
with
Cache-Control: public, must-revalidate
with (say) a max-age of a week, except for the three following
exceptions which need different or rapidly-changing data (I'll call it
"dynamic content") served to different users for the same URL:
(a) pages for logged-in users
(b) pages for anon users who have a pending message
(c) pages with auto-generated dynamic content (Special: pages, and any
others with similar behaviour)
which would be served with the anti-caching cache control header as before.
Since all pages would be must-revalidate, the Wikimedia cluster would
still get a conditional GET request per hit, so that it could check
freshness, then decide which header to generate, based on source IP and
any user cookies. The twist would be that the page would be reported as
outdated by the server if _either_ it had been changed since the cache
stored it, _or_ dynamic content was needed, thus serving the desired
dynamic content to those users who need it, whilst preventing that
content from being cached for other users.
Since 95%+ of all hits are presumably from anons without pending
messages, this should, in an ideal world, result in a very large number
of pages being successfully served by hits on ISPs' proxy caches,
without stopping dynamic content from being served to those users who
need it, or affecting the freshness of pages for anons.
The hit rate would not be quite as high as possible, since every hit
from a dynamic-content user would "wash out" any static version of the
page in question from the cache, but since these users would only
account for about 1 in 20 of page accesses, the remaining 19 out of 20
times there will still be a hit.
I'd be interested to hear what others think. Is there an obvious flaw in
my reasoning? Is this worth a try?
-- Neil
-------------------------------------------------------------
Pseudo-code:
if (logged_in_user) or (user_has_messages) or (special_page):
say has changed; serve with Cache-Control: private. must-revalidate
else:
if (modification_date > if_modified_since_date):
say has changed; serve with Cache-Control: public. must-revalidate
else:
say has not changed