Miranche wrote:
Greetings Wikitechies,
I'm working on a research project on Wikipedia, and I'd like to create or obtain
a historical snapshot of Wikipedia on or about a given date. I'm familiar
enough with mediawiki that I could hack my own script recreating the contents of
the "cur" table from the corresponding history (eg. by calling
getRevisionText()
a couple of 10^5 times). However since I'd hate to rediscover the wheel, I'd
appreciate if you could let me know if this has been done before, if there are
archives of old snapshots, or if there's an easier way to approach it
technically.
I seem to remember someone looking into this, I don't know if they
completed it.
There are a few difficulties which mean you can't produce a completely
accurate past snapshot from a recent dump, in particular:
* Pages which have been renamed may appear at different locations; it's
difficult to track back to what the title was at a given past time.
* Pages which have been since deleted will not appear at all
* In rarer cases, histories have been merged after being accidentally
separated during editing, or individual revisions have been removed due
to copyright or other legalish issues.
* Image file uploads suffer similarly; there are no renames there but
older versions may be deleted (usually for vandalism)
You can however extract a reasonable approximation of the contents of
the wiki at a given time, depending on your needs.
-- brion vibber (brion @
pobox.com)