Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

719 discussions

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail. --tomasz Erik Zachte wrote: > I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. > For the record the 2008-10-03 dump existed for a short while only. > It evaporated before wikistats and many others could parse it, > so now we can finally catch up from 3.5 (!) years backlog. > > Erik Zachte > >> -----Original Message----- >> From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l- >> bounces(a)lists.wikimedia.org] On Behalf Of Tomasz Finc >> Sent: Thursday, March 11, 2010 4:11 >> To: Wikimedia developers; xmldatadumps-admin-l(a)lists.wikimedia.org; >> xmldatadumps(a)lists.wikimedia.org >> Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- >> meta-history.xml.bz2 :D >> >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab >> it >> from >> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- >> meta-history.xml.bz2 >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to >> other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > _______________________________________________ > Xmldatadumps-admin-l mailing list > Xmldatadumps-admin-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

14 years, 2 months

by jcms

-- Este mensaje le ha llegado mediante el servicio de correo electronico que ofrece Infomed para respaldar el cumplimiento de las misiones del Sistem a Nacional de Salud. La persona que envia este correo asume el compromiso de usar el servicio a tales fines y cumplir con las regulaciones establecidas Infomed: http://www.sld.cu/

14 years, 2 months

2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz

14 years, 2 months

Wikipedia history dumps

by Lev Muchnik

Hi Guys, I've know, there are plans to optimize the dump engine so that it could handle the complete English history dump. Is it going to happen any time soon? I personally need it for research, and I've seen quite a few people around waiting for it as well. Thanks, Lev

14 years, 7 months

by cecilca

14 years, 8 months

Looking for files forgotten by the wiki

by Platonides

Tomasz, can you grep old logging for an upload entry of File:Olympic Highway - Moorong.jpg (uploaded 16 jul 2009 <http://commons.wikimedia.org/w/index.php?title=File:Olympic_Highway_-_Mooro…>) or File:Renoir, Pierre-Auguste - The Two Sisters, On the Terrace.jpg (14 jul 2009 <http://commons.wikimedia.org/w/index.php?title=File:Renoir,_Pierre-Auguste_…>, not the one for 15 jul 2009) ? Dumps prior to 20090804 are not publicly available. The objective is to look for evidence about the disappeared upload logs of those files (bug 20744). It'd be something like gzip -d < commonswiki-200907*-pages-logging.xml.gz|grep -A 10 -B 10 Moorong.jpg Presence of Olympic Highway - Moorong.jpg into image.sql would also be interesting.

14 years, 8 months

Re: [Xmldatadumps-l] Very slow import of XML dumps

by Felipe Ortega

> For example, I had to manually increase the number of > threads for 7ZIP to speed it up, as you can see. It will Sorry, I meant PIGZ :-). Fire fingers. F. > > _______________________________________________ > > Xmldatadumps-l mailing list > > Xmldatadumps-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > > > > > >

14 years, 9 months

Very slow import of XML dumps

by Bilal Abdul Kader

Greetings, I am trying to import the French wiki (full history xml) on a Ubuntu machine with quad-core trendy CPU and 16 GB RAM. The import query is the following: java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar --format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki -p frwikiLatest I have disabled the autocommit for mysql, disabled foreign key checks and unique checks. I have set the pool size, buffer log size, and the buffer size to large values as recommended for mysql good performance. After around 3 minutes of running the above command, I have got: 6 pages (0.083/sec), 1,000 revs (13.889/sec) 8 pages (0.038/sec), 2,000 revs (9.378/sec) 13 pages (0.041/sec), 3,000 revs (9.458/sec) The source file is on its own physical disk and the mysql data folder is on another physical disks. Both disks are very fast. *Any suggestions on how to improve the speed? * another issue is that the InnoDB (page, revision, text) do not show the number of the records although the size of the table is non-zero. I think the might be related to disable keys query. *Is that correct? * bilal

14 years, 9 months

Marking Redirects in Snapshots

by Tomasz Finc

We've started marking redirects within each one of the archive snapshots. Starting on 7/28/09 each history and article snapshot will contain a <page> .. <redirect /> <revision> .. </page> entry so that everyone can easily identify which articles are in fact simply redirects. This came as a request from Erik Zachte to further improve on our stats collection and has allowed us to surface more user contributor stats that are not filled with articles lacking significant content. --tomasz

14 years, 10 months

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] Degraded Raid Controller on dumps/snapshots storage node

by Tomasz Finc

Tomasz Finc wrote: > Tomasz Finc wrote: >> Brion Vibber wrote: >>> Tomasz Finc wrote: >>>> Looks like we aren't getting in the replacement drives until mon/tues >>>> of next week so the array will continue to be in degraded state until >>>> then. Thankfully it's still under warranty so the turn around wont be >>>> too bad. Tentatively putting the work to happen on Tuesday now. >>> We were able to put in new disks today, but the raid array didn't fully >>> recover. We got lots of I/O errors, and have been unable to run JFS >>> recovery successfully so far. >>> >>> In the meantime we're running http://download.wikimedia.org/ off the >>> copy of the last couple of dumps that had been copied to another server. >>> The dump _files_ are there but currently the index is not. >>> >>> We're not 100% sure whether we'll be able to recover the earlier dumps >>> or not, but of course more will be made soon enough. :) >>> >>> Some additional files such as the MediaWiki release download and DVD ISO >>> downloads are still in process of being restored. >>> >>> -- brion >> Thanks for the update Brion. I'll be checking in with Rob tomorrow to >> see how ready the new set of drives are and if we are set to start >> generating the snapshots anew. >> >> --tomasz >> >> _______________________________________________ >> Xmldatadumps-admin-l mailing list >> Xmldatadumps-admin-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l > > Sadly Fred and Rob took a look at the JFS storage and were not able to > salvage any of the existing file system. We've gone ahead and started > clean with the archives I made last week as the seeds. > > There will be one more day of testing tomorrow for drive removal and I > expect to have the system back up and running by the end of the week. It > should take about a week after to get a full cycle of all wikis. > Everything has been looking really good so far and I'm finally comfortable in starting the snapshots back up. The only bit left to do is to test by pulling a drive but that will have to wait till we have RobH on site again. Were currently running at five snapshot processes and if nothing weird happens I'll dial it up to eight. --tomasz

14 years, 10 months

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l