Hi,
I'm trying to get a hold of the wikipedia dump , in particular
enwiki-latest-pages-meta-history.xml.bz2
It seems that on the page where it's supposed to be
(http://download.wikipedia.org/enwiki/latest/) it's weighing at 0.6KB
whereas I was used for it to be 147GB
What happened to the data and where did it went ?
Also , on the wikipedia (
http://en.wikipedia.org/wiki/Wikipedia_database ) page I read
"As of January 17 </wiki/January_17>, 2009 </wiki/2009>, it seems that
all snapshots of pages-meta-history.xml.7z hosted
at http://download.wikipedia.org/enwiki/ are missing. The developers at
Wikimedia Foundation are working to address this issue
(http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html).
There are other ways to obtain this file"
I checked the other ways of obtaining the file that they describe , none
worked.
Why did the dumps vanished and how can I download a copy of them ?
Thank you
Greetings,
I noticed that this enwiki dump (http://dumps.wikimedia.org/enwiki/20090520/)
was completed on the 25th but in fact it is not complete. It is missing the
behemoth (pages-meta-history.xml).
bilal
While chatting with various people about data retention the question of
keeping the bz2 compressed files of pages-meta-history.xml vs their 7z
equivalents came up.
I'm curious about the usage of bz2 vs. 7z for the full page history. If
we can get 7za to not be a bottleneck for the build then would anyone be
crushed if we dropped support for the bz2 version?
It would be a significant savings in space.
I know the initial decision to serve both was made at a time when the
availability of 7zip for multiple OS's was questionable at best. Today
there are supported releases for Windows and Linux (src) and a
fragmented but active set of OSX ports.
Thoughts?
--tomasz
Now that we are generating all but the biggest of wiki's reliably I'd
like to start the discussion of retention for older data base dumps.
If we can reliably stick to a two week window for each wiki's dump
iteration, how many dumps would back would it make sense to keep?
Most clients that I've talked to only need the latest and simply look at
the older ones in case the newest dump failed a step.
If there are other retention cases then I'd love to hear them and figure
what's feasible to do.
Operations wise I'd be thinking of keeping somewhere between 1-5 of the
previous dumps and then archiving copies of each dump at 6month windows
for permanent storage. Doing that for all of the current dumps is way
more space then we have currently available but that's also why were
working on funding for those storage servers.
Is that overkill or simply not enough? let know.
--tomasz