Neil Harris wrote:
I'm not suggesting the log files should be polled
in real-time: that
would be silly.
That was my understanding also. I mentioned polling in the context of
constructing log files, not reading them.
Rather, the URL invalidation file needs to be pulled
down only once per batch pre-fetch session, a total of perhaps 6 Mbytes
per day for each client site, assuming 60 chars per log entry and
100,000 edits a day. This assumes no compression: this file should gzip
extremely efficiently due to the strong repetition in both URLs and
timestamp strings, so the file might be only 2 or 3 Mbytes long if
dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion
of overall traffic. A cache issuing an HTTP HEAD request to check
freshness will return a tiny fraction of the number of bytes that an
HTTP GET will: if the data is stale, a full GET will be required,
regardless of whether the cache invalidation is done on-demand at page
fetch time or by real-time streaming.
Caches use ICP, not HTTP HEAD. The only clients that use HEAD are link
checkers. Browsers use a GET method with a Last-Modified header.
The advantage of the streaming approach is (as I
understand it)
* to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of
revalidation, only browsers can. Wikipedia sends "Cache-Control:
private,must-revalidate" which disables intermediate caches entirely.
The point of the streaming approach is to allow an intermediate cache at
all, that's why it was developed concurrently with our initial
deployment of squid.
Maybe implementing the kind of revalidation you're talking about would
be a useful step.
* to push some of the GET requests into the off-peak
time, which is a
win providing the page is not touched between the off-peak fetch and a
user's on-peak access to the same page (but a loss if the page is edited
between then and the user access, as it simply wastes off-peak bandwidth
to no useful effect).
[...]
That's ancillary, and it hasn't been developed yet.
-- Tim Starling