Thankfully due to an awesome volunteer we'll be able to get that 2008
snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
> I'm thrilled. Big thanks to Tim and Tomasz for pulling this off.
> For the record the 2008-10-03 dump existed for a short while only.
> It evaporated before wikistats and many others could parse it,
> so now we can finally catch up from 3.5 (!) years backlog.
>
> Erik Zachte
>
>> -----Original Message-----
>> From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l-
>> bounces(a)lists.wikimedia.org] On Behalf Of Tomasz Finc
>> Sent: Thursday, March 11, 2010 4:11
>> To: Wikimedia developers; xmldatadumps-admin-l(a)lists.wikimedia.org;
>> xmldatadumps(a)lists.wikimedia.org
>> Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-
>> meta-history.xml.bz2 :D
>>
>> New full history en wiki snapshot is hot off the presses!
>>
>> It's currently being checksummed which will take a while for 280GB+ of
>> compressed data but for those brave souls willing to test please grab
>> it
>> from
>>
>> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-
>> meta-history.xml.bz2
>>
>> and give us feedback about its quality. This run took just over a month
>> and gained a huge speed up after Tims work on re-compressing ES. If we
>> see no hiccups with this data snapshot, I'll start mirroring it to
>> other
>> locations (internet archive, amazon public data sets, etc).
>>
>> For those not familiar, the last successful run that we've seen of this
>> data goes all the way back to 2008-10-03. That's over 1.5 years of
>> people waiting to get access to these data bits.
>>
>> I'm excited to say that we seem to have it :)
>>
>> --tomasz
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
--
Este mensaje le ha llegado mediante el servicio de correo electronico
que ofrece Infomed para respaldar el cumplimiento de las misiones del Sistem
a Nacional de Salud. La persona que envia este correo asume el compromiso de
usar el servicio a tales fines y cumplir con las regulaciones establecidas
Infomed: http://www.sld.cu/
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of
compressed data but for those brave souls willing to test please grab it
from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi…
and give us feedback about its quality. This run took just over a month
and gained a huge speed up after Tims work on re-compressing ES. If we
see no hiccups with this data snapshot, I'll start mirroring it to other
locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this
data goes all the way back to 2008-10-03. That's over 1.5 years of
people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Hi Guys,
I've know, there are plans to optimize the dump engine so that it could
handle the complete English history dump. Is it going to happen any time
soon? I personally need it for research, and I've seen quite a few people
around waiting for it as well.
Thanks,
Lev
--
Este mensaje le ha llegado mediante el servicio de correo electronico
que ofrece Infomed para respaldar el cumplimiento de las misiones del Sistem
a Nacional de Salud. La persona que envia este correo asume el compromiso de
usar el servicio a tales fines y cumplir con las regulaciones establecidas
Infomed: http://www.sld.cu/
Tomasz, can you grep old logging for an upload entry of File:Olympic
Highway - Moorong.jpg (uploaded 16 jul 2009
<http://commons.wikimedia.org/w/index.php?title=File:Olympic_Highway_-_Mooro…>)
or File:Renoir, Pierre-Auguste - The Two Sisters, On the Terrace.jpg (14
jul 2009
<http://commons.wikimedia.org/w/index.php?title=File:Renoir,_Pierre-Auguste_…>,
not the one for 15 jul 2009) ?
Dumps prior to 20090804 are not publicly available. The objective is to
look for evidence about the disappeared upload logs of those files (bug
20744).
It'd be something like gzip -d <
commonswiki-200907*-pages-logging.xml.gz|grep -A 10 -B 10 Moorong.jpg
Presence of Olympic Highway - Moorong.jpg into image.sql would also be
interesting.
> For example, I had to manually increase the number of
> threads for 7ZIP to speed it up, as you can see. It will
Sorry, I meant PIGZ :-). Fire fingers.
F.
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >
>
>
>
>
Greetings,
I am trying to import the French wiki (full history xml) on a Ubuntu machine
with quad-core trendy CPU and 16 GB RAM. The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc
-XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC
-XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar
--format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki
-p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign key checks and
unique checks. I have set the pool size, buffer log size, and the buffer
size to large values as recommended for mysql good performance.
After around 3 minutes of running the above command, I have got:
6 pages (0.083/sec), 1,000 revs (13.889/sec)
8 pages (0.038/sec), 2,000 revs (9.378/sec)
13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql data folder is on
another physical disks. Both disks are very fast.
*Any suggestions on how to improve the speed?
*
another issue is that the InnoDB (page, revision, text) do not show the
number of the records although the size of the table is non-zero. I think
the might be related to disable keys query.
*Is that correct?
*
bilal
We've started marking redirects within each one of the archive
snapshots. Starting on 7/28/09 each history and article snapshot will
contain a
<page>
..
<redirect />
<revision>
..
</page>
entry so that everyone can easily identify which articles are in fact
simply redirects.
This came as a request from Erik Zachte to further improve on our stats
collection and has allowed us to surface more user contributor stats
that are not filled with articles lacking significant content.
--tomasz
Tomasz Finc wrote:
> Tomasz Finc wrote:
>> Brion Vibber wrote:
>>> Tomasz Finc wrote:
>>>> Looks like we aren't getting in the replacement drives until mon/tues
>>>> of next week so the array will continue to be in degraded state until
>>>> then. Thankfully it's still under warranty so the turn around wont be
>>>> too bad. Tentatively putting the work to happen on Tuesday now.
>>> We were able to put in new disks today, but the raid array didn't fully
>>> recover. We got lots of I/O errors, and have been unable to run JFS
>>> recovery successfully so far.
>>>
>>> In the meantime we're running http://download.wikimedia.org/ off the
>>> copy of the last couple of dumps that had been copied to another server.
>>> The dump _files_ are there but currently the index is not.
>>>
>>> We're not 100% sure whether we'll be able to recover the earlier dumps
>>> or not, but of course more will be made soon enough. :)
>>>
>>> Some additional files such as the MediaWiki release download and DVD ISO
>>> downloads are still in process of being restored.
>>>
>>> -- brion
>> Thanks for the update Brion. I'll be checking in with Rob tomorrow to
>> see how ready the new set of drives are and if we are set to start
>> generating the snapshots anew.
>>
>> --tomasz
>>
>> _______________________________________________
>> Xmldatadumps-admin-l mailing list
>> Xmldatadumps-admin-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
> Sadly Fred and Rob took a look at the JFS storage and were not able to
> salvage any of the existing file system. We've gone ahead and started
> clean with the archives I made last week as the seeds.
>
> There will be one more day of testing tomorrow for drive removal and I
> expect to have the system back up and running by the end of the week. It
> should take about a week after to get a full cycle of all wikis.
>
Everything has been looking really good so far and I'm finally
comfortable in starting the snapshots back up. The only bit left to do
is to test by pulling a drive but that will have to wait till we have
RobH on site again.
Were currently running at five snapshot processes and if nothing weird
happens I'll dial it up to eight.
--tomasz