Hello Herbert.
Herbert Van de Sompel wrote:
2. Let me describe the actual status and challenges
faced in the
Memento plug-in work:
2.1. The plug-in detects a client's X-Accept-Datetime header, and
returns the mediawiki page that was active at the datetime specified
in the header. Same for images, actually.
2.2. Display history pages with the template that was
active at
the time the history page acted as the current one. [Snip] So, we are
looking at the mediawiki code to see whether a history page, when
rendered, could itself retrieve the appropriate (old) template from
the database. If we are successful, we will share that code also at
http://www.mediawiki.org/wiki/Extension:Memento
once available. It will obviously be up to the mediawiki community
whether they are willing to adopt the proposed change to the codebase.
Obviously it's a server issue.
2.3. We have looked into another issue raised by
Jakob: Display
deleted pages as they existed at the datetime expressed in X-Datetime-
Accept. We have actually implemented this. There are 2 caveats:
- as is the case with mediawiki in general, deleted pages are only
accessible by those with appropriate permissions;
- as is the case with mediawiki in general, deleted pages show up in
Edit mode.
This code will soon be included at
http://www.mediawiki.org/wiki/Extension:Memento
Showing deleted pages in edit mode is not always the case, since they
can't be rendered (albeit not with the old templates, which would be an
interesting enhacement by your work).
It is impressive how far you have gone. However, I don't think you can
do a *complete* implementation.
First, you should be aware that timemachining the pages has been tried
in the past. Discussions treating FlaggedReves are also relevant for
your project.
FlaggedRevs is an extension which allow to mark the status of a page
(eg. not vandalised) at a point in time. A naive implementation would
store the timestamp and get the old version from the archive. They ended
up storing in a table specific to the extension the page content with
templates transcluded.
However, flaggedrevs is a tool to fight vandalism. Yours is an archival
one. You could accept imperfect results under certain circunstances.
Problematic aspects:
Page moves/image moves:
*You want to see content of Foo at epoch, but the history now at Foo is
wrong. Instead you need to look at that history of the page now at
Foo_(disambiguation)
You need to follow (perhaps even many times) the move logs to find out
the real page.
Page merges:
*When two pages have been merged, you will want to show the revision
which was originally at the page the user wants to timemachine. You can
no longer just rely on the timestamps. You may be able to get that by
splitting the sources at the merge time and going back via
rev_parent_id. Needless to say, this is very inefficient, this piece
wouldn't be put live at wikipedia.
Partial undeletions:
*When a page is undeleted, the summary shows how many revisions were
undeleted, but not *which* ones.
Case:
*Page A has two edits (#1 and #2).
*A vandal adds obscene content to it (#3).
*Admin deletes the page and restores the two first revisions.
*Several months later, the page is completely deleted.
When an admin wants to view what the page looked like those months, an
application is unable to determine if the two revisions which had been
shown were #1 and #2 or perhaps #2 and #3.
revdelete may have similar issues.
2.4. We do not feel that all pages should necessarily
be subject to
datetime content negotiation, in the same way that not all URIs are
subject to content negotiation in other dimensions. We feel that the
Special Pages fall under this category, as they do not have History.
2.5. We have ideas regarding how to address the issue raised by
Daniel: the timestamp isn't a unique identifier, multiple revisions
*might* have the
same timestamp. From the perspective of Memento, a datetime is
obviously the only "globally" recognizable value that can be used for
negotiation. If cases occur where multiple versions of a page exist
for the same second, the thing to do according to RFC 2295 would be to
return a "300 Mutliple Choices", listing the URIs (and metadata) of
those version in an Alternates header. The client then has to take it
from there.
2.6. The caching issue is a general problem arising
from introducing
Memento in a web that does not (yet) do Memento: when in datetime
content negotiation mode all caches between client and server (both
included) need to be bypassed. As described in our paper, we currently
address this problem by adding the following client headers:
Cache-Control: no-cache => to force cache revalidation, and
If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT' to enforce
validation failure
We very much understand this is not elegant but it tends to work ;-) .
The caching issue is IMHO the bigger problem in your approach using the
new header.
Disabling cache on the request kind of work (although not in the long
term), but you also need to disable caching at the server, so when
someone accessing by your same proxy (ignorant of X-Accept-Datetime) to
the current page doesn't get the cached page you were served earlier.
RFC 2145 states very clearly that "A proxy MUST forward an unknown
header", but in your case it'd have been preferable that the header
wasn't forwarded if the proxy isn't memento aware.
Which leads us to another issue, which is that it seems your server
implementation doesn't "acknowledge" memento, so given a response to a
X-Accept-Datetime, you don't know if what you're getting is the version
you requested or the current one (because the server ignored it).
It can be as simple as requiring a Last-Modified <= X-Accept-Datetime on
Accept-Datetime responses (that would allow the server to explicitely
tell since when is it valid), but extended to all response codes.