Tim Starling wrote:
Neil Harris wrote:
The advantage of the streaming approach is (as I
understand it)
* to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of
revalidation, only browsers can. Wikipedia sends "Cache-Control:
private,must-revalidate" which disables intermediate caches entirely.
The point of the streaming approach is to allow an intermediate cache at
all, that's why it was developed concurrently with our initial
deployment of squid.
Maybe implementing the kind of revalidation you're talking about would
be a useful step.
Ah. There's more to this than I had realised. I hadn't realised that was
the case -- but I can see that it makes sense, since Wikipedia may serve
up different content to different users for the same URL.
How about _eliminating_ this behaviour, by having two possible URLs for
each page "/wiki/X" and "/dynwiki/X", serving exactly the same content
as a /wiki/ URL serves now?
/wiki/ URLs would be for general readers, and marked as "public,
must-revalidate", and would serve the same content to every user.
/dynwiki/ URLs would be for readers who may recieve content which may
vary from the normal appearance sent to anons, (anons with messages, and
all logged-in users) and are marked as "private, must-revalidate".
Both classes of URLs would internally be rewritten to exactly the same
internal URLs, and call the same code, as at present: the difference is
that /dynwiki/ pages would be non-cachable versions of the same content.
Effectively, the difference between the two URLs is only a hint to any
caches in the way as to whether the page is cacheable.
When we get the conditional GET which every hit will generate, we can work out which page
to serve based on the message flag for anons, and the presence of user cookies for
logged-in users. If you access a /wiki/ page, and you should be getting /dynwiki/ content,
you will be redirected to the corresponding /dynwiki/ URL. Similarly if you are accessing
dynamic content, and should be getting the static content. All of the links on a page
would belong to the same base URL as the transmitted page, so the dynamic-state would be
"sticky", and there would not need to be many redirects: generally, only one for
every change of state from dynamic to static or vice-versa for a given user.
Web crawlers, and the rest of the world, will generally see only the /wiki/ URLs and
content. Only logged-in users and anons with messages would see the /dynwiki/ content.
If this works as I imagine, it would have the effect of rendering the entire Wikipedia
cachable for the (I imagine) 90%+ of readers who are not logged in. Conditional GETs would
still be needed for every page, but the bulk of the data would not need to be shifted
whenever there is a hit. If this works, it could substantially reduce the average number
of bytes shifted per page hit.
This would also have the effect of making third-world cached sites behind thin pipes far
more efficient.
It's late here, and I'm tired, and this seems too good to be true, so it probably
isn't. I'll think about it again in the morning.
-- Neil