Chad and I have been playing around with a SVN->Git conversion of
MediaWiki. After running into some odd issues with git-svn and since
it takes around 3 weeks to do a complete git-svn import (with
branches) of the MediaWiki SVN repository I'd like to get access to
`svnadmin dump' output as run on mayflower.
Are these maybe being done already as part of some backup procedure
but aren't made public? If there's interest I'd like to write any
required scripts / hacks required to get SVN dumps on on
http://download.wikimedia.org/
Hi folks,
We've been accepted again for another Google Summer of Code! What this
means:
* Mentors: please go to this page to formally apply to be a mentor:
http://socghop.appspot.com/gsoc/mentor/request/google/gsoc2010/wikimedia
Note: you can't officially be a mentor until you do this, and we can't do it
for you (part of it involves agreeing to the mentor agreement).
Question for the group: how many student slots do you think we should
request? On the "advice for mentors page", it says: "
A good rule of thumb when finding and assigning mentors is to have two
mentors per student. It is also a good idea to have a spare mentor or two
who can pay attention to many students and keep track of the big picture."
Given our current list of mentors (we have 9 listed, plus 1 "maybe"), that
would give us "4" as the number of slots. Does that seem like a number
that's both low enough that we can be reasonably confident we'll do a good
job mentoring, but high enough that we're not selling ourselves short?
* Students: it's still not yet formally time to apply, but now is a really
good time to start brainstorming ideas, and getting clarifications on what's
already been suggested:
http://www.mediawiki.org/wiki/Summer_of_Code_2010
While you may be tempted (from a competitive perspective) not to reveal what
your ideas are early, it is almost certainly going to be to your benefit to
engage now. By "engage", I mean "demonstrate that you're really thinking
about how to improve MediaWiki and other Wikimedia project technologies, and
have the wherewithal to do it", not merely "impress us with what skills you
have". The more specific and thoughtful your ideas, questions, and
suggestions are, the more comfortable we'll all feel in selecting you.
You might want to take a peek at the GSoC student agreement now, since
you'll be required to agree to it as a precondition for being part of this
year's program:
http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/stude…
Rob
Date:
Wed, 17 Mar 2010 15:15:24 +0100
From: Platonides
<Platonides(a)gmail.com>
Subject: Re: [Wikitech-l]
[Xmldatadumps-admin-l] 2010-03-11 01:10:08:
enwiki Checksumming
pages-meta-history.xml.bz2 :D
To: wikitech-l(a)lists.wikimedia.org
Message-ID:
<hnqo49$itc$1(a)dough.gmane.org>
Content-Type: text/plain;
charset=ISO-8859-1; format=flowed
Jamie Morken wrote:
>
Also I wonder if it is possible to convert from 7z to bz2 without having
>
to make the 5469GB file first? If this can be done then having only 7z
>
files would be fine, as the bz2 file could be created with a "normal"
>
PC (ie one without a 6TB+ harddrive). This would be a good solution,
>
but not sure if it can be done. If it could though, might as well get
>
rid of all the large wiki's bz2 pages-meta-history files!
Sure.
7z
e -so DatabaseDump.7z | bzip -9 > DatabaseDump.bz
Hi,
Thanks for the info, I think 7z is the way to go :)
cheers,
Jamie
About 40% of our text storage has been recompressed into
DiffHistoryBlob format, which uses a combination of binary diffs and
gzip to reduce storage space.
Approximately 1.9TB of text storage, mostly revisions compressed
individually with gzip, was recompressed to about 140GB, a saving of 93%.
-- Tim Starling
Let alone that, for some of us outside USA (and even with a good connection to the EU resarch network) the download process takes, so to say, slightly more time than expected (and is prone to errors as the file gets larger).
So other +1 to replace bzip with 7zip.
F.
--- El mar, 16/3/10, Kevin Webb <kpwebb(a)gmail.com> escribió:
> De: Kevin Webb <kpwebb(a)gmail.com>
> Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
> Para: "Lev Muchnik" <levmuchnik(a)gmail.com>
> CC: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>, xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org
> Fecha: martes, 16 de marzo, 2010 22:35
> Yeah, same here. I'm totally fine
> with replacing bzip with 7zip as the
> primary format for the dumps. Seems like it solves the
> space and speed
> problems together...
>
> I just did a quick benchmark and got a 7x improvement on
> decompression
> speed using 7zip over bzip using a single core, based on
> actual dump
> data.
>
> kpw
>
>
>
> On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik <levmuchnik(a)gmail.com>
> wrote:
> >
> > I am entirely for 7z. In fact, once released, I'll be
> able to test the XML
> > integrity right away - I process the data on the fly,
> without unpacking it
> > first.
> >
> >
> > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc(a)wikimedia.org>
> wrote:
> >>
> >> Kevin Webb wrote:
> >> > I just managed to finish decompression. That
> took about 54 hours on an
> >> > EC2 2.5x unit CPU. The final data size is
> 5469GB.
> >> >
> >> > As the process just finished I haven't been
> able to check the
> >> > integrity of the XML, however, the bzip
> stream itself appears to be
> >> > good.
> >> >
> >> > As was mentioned previously, it would be
> great if you could compress
> >> > future archives using pbzib to allow for
> parallel decompression. As I
> >> > understand it, the pbzip files are reverse
> compatible with all
> >> > existing bzip2 utilities.
> >>
> >> Looks like the trade off is slightly larger files
> due to pbzip2's
> >> algorithm for individual chunking. We'd have to
> change the
> >>
> >> buildFilters function in http://tinyurl.com/yjun6n5 and install the new
> >> binary. Ubuntu already has it in 8.04 LTS making
> it easy.
> >>
> >> Any takers for the change?
> >>
> >> I'd also like to gauge everyones opinion on moving
> away from the large
> >> file sizes of bz2 and going exclusively 7z. We'd
> save a huge amount of
> >> space doing it at a slightly larger cost during
> compression.
> >> Decompression of 7z these days is wicked fast.
> >>
> >> let know
> >>
> >> --tomasz
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Xmldatadumps-admin-l mailing list
> >> Xmldatadumps-admin-l(a)lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> >
> >
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
--- El mar, 16/3/10, Kevin Webb <kpwebb(a)gmail.com> escribió:
> De: Kevin Webb <kpwebb(a)gmail.com>
> Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
> Para: "Tomasz Finc" <tfinc(a)wikimedia.org>
> CC: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>, xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org
> Fecha: martes, 16 de marzo, 2010 21:10
> I just managed to finish
> decompression. That took about 54 hours on an
> EC2 2.5x unit CPU. The final data size is 5469GB.
>
> As the process just finished I haven't been able to check
> the
> integrity of the XML, however, the bzip stream itself
> appears to be
> good.
>
> As was mentioned previously, it would be great if you could
> compress
> future archives using pbzib to allow for parallel
> decompression. As I
> understand it, the pbzip files are reverse compatible with
> all
> existing bzip2 utilities.
>
Yes, they're :-).
Regards,
F.
> Thanks again for all your work on this!
> Kevin
>
>
> On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc <tfinc(a)wikimedia.org>
> wrote:
> > Tomasz Finc wrote:
> >> New full history en wiki snapshot is hot off the
> presses!
> >>
> >> It's currently being checksummed which will take a
> while for 280GB+ of
> >> compressed data but for those brave souls willing
> to test please grab it
> >> from
> >>
> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi…
> >>
> >> and give us feedback about its quality. This run
> took just over a month
> >> and gained a huge speed up after Tims work on
> re-compressing ES. If we
> >> see no hiccups with this data snapshot, I'll start
> mirroring it to other
> >> locations (internet archive, amazon public data
> sets, etc).
> >>
> >> For those not familiar, the last successful run
> that we've seen of this
> >> data goes all the way back to 2008-10-03. That's
> over 1.5 years of
> >> people waiting to get access to these data bits.
> >>
> >> I'm excited to say that we seem to have it :)
> >
> > So now that we've had it for a couple of days .. can I
> get a status
> > report from someone about its quality?
> >
> > Even if you had no issues please let us know so that
> we start mirroring.
> >
> > --tomasz
> >
> > _______________________________________________
> > Xmldatadumps-admin-l mailing list
> > Xmldatadumps-admin-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> >
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
This past weekend, at the SXSW conference, a new initiative
was launched to "get video on Wikipedia",
http://videoonwikipedia.org/
That sounds like a great idea.
(I wasn't there, but I was told.)
But among the first videos to be uploaded since the
announcement are two that show some construction
equipment and both break my browser every time I try
to watch them. How can this be possible with a fully
updated Mozilla Firefox 3.5.8 on Ubuntu Linux?
I suppose something went wrong in the OGG encoding,
but still, browsers should not be fooled by this,
and/or Wikimedia Commons needs to make sure videos
are correctly encoded so they can be safely watched.
I have asked that these two broken videos be removed,
http://commons.wikimedia.org/wiki/File:6hpPowerTrowel.ogvhttp://commons.wikimedia.org/wiki/File:13hpBoren.ogv
We discussed for a long time why OpenOffice documents
can't be uploaded to Wikimedia Commons because the
ZIP encoding wasn't safe and could explode in the
face of the user. Well, maybe OGG isn't safe either?
Should we just ban video all together?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se