Re: [Wikitech-l] Making Wikipedia cacheable, without altering current behaviour

17 Oct 2005

Tim Starling wrote:

...
 Neil Harris wrote:

 The advantage of the streaming approach is (as I
understand it)
* to eliminate the need for HEAD requests

No, intermediate caches can't be relied on to do that kind of
revalidation, only browsers can. Wikipedia sends "Cache-Control:
private,must-revalidate" which disables intermediate caches entirely.
The point of the streaming approach is to allow an intermediate cache at
all, that's why it was developed concurrently with our initial
deployment of squid.

Maybe implementing the kind of revalidation you're talking about would
be a useful step.

 Ah. There's more to this than I had realised. I hadn't realised that was 
the case -- but I can see that it makes sense, since Wikipedia may serve 
up different content to different users for the same URL.

How about _eliminating_ this behaviour, by having two possible URLs for 
each page "/wiki/X" and "/dynwiki/X", serving exactly the same content

as a /wiki/ URL serves now?

/wiki/ URLs would be for general readers, and marked as "public, 
must-revalidate", and would serve the same content to every user.

/dynwiki/ URLs would be for readers who may recieve content which may 
vary from the normal appearance sent to anons, (anons with messages, and 
all logged-in users) and are marked as "private, must-revalidate".

Both classes of URLs would internally be rewritten to exactly the same 
internal URLs, and call the same code, as at present: the difference is 
that /dynwiki/ pages would be non-cachable versions of the same content. 
Effectively, the difference between the two URLs is only a hint to any 
caches in the way as to whether the page is cacheable.

When we get the conditional GET which every hit will generate, we can work out which page
to serve based on the message flag for anons, and the presence of user cookies for
logged-in users. If you access a /wiki/ page, and you should be getting /dynwiki/ content,
you will be redirected to the corresponding /dynwiki/ URL. Similarly if you are accessing
dynamic content, and should be getting the static content. All of the links on a page
would belong to the same base URL as the transmitted page, so the dynamic-state would be
"sticky", and there would not need to be many redirects: generally, only one for
every change of state from dynamic to static or vice-versa for a given user.

Web crawlers, and the rest of the world, will generally see only the /wiki/ URLs and
content. Only logged-in users and anons with messages would see the /dynwiki/ content. 

If this works as I imagine, it would have the effect of rendering the entire Wikipedia
cachable for the (I imagine) 90%+ of readers who are not logged in. Conditional GETs would
still be needed for every page, but the bulk of the data would not need to be shifted
whenever there is a hit. If this works, it could substantially reduce the average number
of bytes shifted per page hit.

This would also have the effect of making third-world cached sites behind thin pipes far
more efficient.

It's late here, and I'm tired, and this seems too good to be true, so it probably
isn't. I'll think about it again in the morning.

-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Making Wikipedia cacheable, without altering current behaviour