Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

719 discussions

by Tomasz Finc

cc'ing xmldatadumpms-l on this. Phil Adams wrote: > hi tomasz, > > phil (philadams) here from #wikimedia-tech earlier today. > > i'm interested in looking at user behaviour on wikipedia, so i figured > that the en wiki stub-meta-history would be a good place to start. i > grabbed and uncompressed the 2009 07/02 version, and started just > exploring it a little. i had a few questions: > > * is this dump supposed to contain ALL revisions to each en wiki page > (articles and user pages in particular)? i ask b/c when i look at the > revision history for (say) AmericanSamoa, the meta dump shows only 5 > or 6 revisions for that page, spread across time from 2001 to 2007. > the en wiki history page online > (http://en.wikipedia.org/w/index.php?title=American_Samoa&action=history) > shows far more edits. what am i missing? The XML files available on download are snapshots in time of our data set. When each snapshot runs, the stub step gets a consistent view of our database at that exact time. Any new revisions will only be available in the next run. AmericanSamoa is showing up just like it should in the snapshot because its a redirect. If you take a look at http://en.wikipedia.org/w/index.php?title=AmericanSamoa&action=history then you will notice that it's only had a handful of edits when compared to http://en.wikipedia.org/w/index.php?title=American%20Samoa&action=history note the space between the two words. > > * is there any sort of ordering to the history dump? it appears > nominally alphabetic, although isn't strictly alphabetic. The ordering is by page id. > > * if i have misunderstood the purpose of the meta dumps, but still > wanted the same information, is my best recourse simply to d/l the > entire en wiki dump? does that contain complete revision histories for > all pages? The only difference between a stub and the full history is the full page content. If you don't need the content then they are effectively the same. --tomasz

14 years, 10 months

Re: [Xmldatadumps-l] enwiki page-meta-history

by Nathan J. Yoder

Why was the newst copy of enwiki with the full history removed from the downloads site? I checked around and was only able to find one place with it: http://www.archive.org/details/enwiki-20080103 You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file, which is about 17GB. There is another file that is 130GB, but that is the SAME thing, just compressed with bz2 insteaf of 7z, making it larger, so don't get that one. Tomasz, I am willing to volunteer my services as a programmer to help with this problem in making full history enwiki dumps, if it is possible (I can't donate hardware/money). What are the issues which are causing it to be so slow and what methods are you employing to improve it? I know that LiveJournal has some sort of live backup system using MySQL and Perl, but couldn't find any details on their presentations. You might be able to ask one of their developers for help, on their LJ blog. Can Wikimedia afford a snapshot server? It doesn't need to be as fast as the others. In the long run, whatever this system is, it will probably need to be integrated into some sort of backup, because it would be a huge pain if something happened at the data center and you needed to restore from the partial quasi-backups in the current systems. How does the current dump method work? Are they incremental in the sense that they build up on previous dumps, instead of re-dumping all of the data? For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps. Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...). Of course, this may all be totally off, since I don't know a lot about the current configuration and issues, so I'll take whatever input you have to help work on something better.

14 years, 11 months

Re: [Xmldatadumps-l] enwiki page-meta-history

by Tomasz Finc

Sebastian Graf wrote: > Hello Tomasz, > > thanks for your quick response. > > Unfortunately I am in need not only of *text* but of *english text* > since we are currently working on an revisioned indexer. > > Are there any english dumps available except the enwiki? Yup, you can grab enwikisource enwikiversity enwikinews enwiktionary enwikiquote metawiki commonswiki --tomasz

14 years, 11 months

enwiki page-meta-history

by Sebastian Graf

Hello everybody, I am a worker at the computer science departement at the University of Konstanz in Germany. We are working on a revisioned native XML database. Wikipedia is therefore the optimal playground when it comes to huge amounts of data since the xml dump is perfect for our application. At the moment I am looking for a new dump for the enwiki which contains all revisions. I know that this XML has to be really huge, but that's why we want to use it. Unfortunately I couldn't find any file called "page-meta-history" on the enwiki download section. Can you help me with some dump, an idea how to get the data,...? greetings sebastian -------------------------------------------------- Sebastian Graf Distributed Systems Lab University of Konstanz Phone: +49 7531 88 4319 Mail: sebastian.graf(a)uni-konstanz.de

14 years, 11 months

wikipedia 147GB dump

by randomcoder1＠gmail.com

Hi, I'm trying to get a hold of the wikipedia dump , in particular enwiki-latest-pages-meta-history.xml.bz2 It seems that on the page where it's supposed to be (http://download.wikipedia.org/enwiki/latest/) it's weighing at 0.6KB whereas I was used for it to be 147GB What happened to the data and where did it went ? Also , on the wikipedia ( http://en.wikipedia.org/wiki/Wikipedia_database ) page I read "As of January 17 </wiki/January_17>, 2009 </wiki/2009>, it seems that all snapshots of pages-meta-history.xml.7z hosted at http://download.wikipedia.org/enwiki/ are missing. The developers at Wikimedia Foundation are working to address this issue (http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html). There are other ways to obtain this file" I checked the other ways of obtaining the file that they describe , none worked. Why did the dumps vanished and how can I download a copy of them ? Thank you

14 years, 11 months

Incomplete English Wiki but Labeled complete

by Bilal Abdul Kader

Greetings, I noticed that this enwiki dump (http://dumps.wikimedia.org/enwiki/20090520/) was completed on the 25th but in fact it is not complete. It is missing the behemoth (pages-meta-history.xml). bilal

15 years

bz2 vs 7z usage of page-meta-history

by Tomasz Finc

While chatting with various people about data retention the question of keeping the bz2 compressed files of pages-meta-history.xml vs their 7z equivalents came up. I'm curious about the usage of bz2 vs. 7z for the full page history. If we can get 7za to not be a bottleneck for the build then would anyone be crushed if we dropped support for the bz2 version? It would be a significant savings in space. I know the initial decision to serve both was made at a time when the availability of 7zip for multiple OS's was questionable at best. Today there are supported releases for Windows and Linux (src) and a fragmented but active set of OSX ports. Thoughts? --tomasz

15 years

logging xml

by Platonides

Looking at the different dumps, no pages-logging.xml.bz2 <http://download.wikimedia.org/eswiki/20090504/eswiki-20090504-pages-logging…> seems to be right. All of them are 14 bytes, the compression of an empty file. However, there're quite big files logging.xml.gz <http://download.wikimedia.org/eswiki/20090504/eswiki-20090504-logging.xml.gz> at the 'Creating split stub dumps' section. Are those logging.xml.gz <http://download.wikimedia.org/eswiki/20090504/eswiki-20090504-logging.xml.gz> files really stubs (what's missing?) or they're just misplaced? Should pages-logging.xml.bz2 contain something different? I suspect that the proper file is the gz and the existance of the bz2 are a mistake, but the xml logging files are quite new, and not too documented, so can't be sure.

15 years

Retention of old database dumps

by Tomasz Finc

Now that we are generating all but the biggest of wiki's reliably I'd like to start the discussion of retention for older data base dumps. If we can reliably stick to a two week window for each wiki's dump iteration, how many dumps would back would it make sense to keep? Most clients that I've talked to only need the latest and simply look at the older ones in case the newest dump failed a step. If there are other retention cases then I'd love to hear them and figure what's feasible to do. Operations wise I'd be thinking of keeping somewhere between 1-5 of the previous dumps and then archiving copies of each dump at 6month windows for permanent storage. Doing that for all of the current dumps is way more space then we have currently available but that's also why were working on funding for those storage servers. Is that overkill or simply not enough? let know. --tomasz

15 years

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l