[Foundation-l] thoughts on leakages

Jerome Banal jerome.banal at gmail.com
Sun Jan 13 22:50:54 UTC 2008


Scraping:

Jeff Merkey was downloading all images used by enwiki in a day and a half
using 16 workstations just a few months ago with his wikix tool so this
is definitely possible.

Not to be done by too much people too often of course but seldom are
those that have enougth bandwith anyway.

He was actually redistributing this image dump through torrent but it was
taking a week to download it. As it was faster to download them
from WP, he killed the tracker.

There is some info in this mailing list history (look around March/April)
and on the net.
The Linux executable is
< ftp://www.wikigadugi.org/wiki/MediaWiki/wikix.tar.gz.bz2> here
and it requires only a XML dump to work.

If you want a torrent dump again, maybe he can provide one
again if you ask him politely.

Jerome


2008/1/13, Robert Rohde <rarohde at gmail.com>:
>
> On Jan 13, 2008 5:56 AM, Anthony <wikimail at inbox.org> wrote:
>
> > On Jan 13, 2008 6:51 AM, Robert Rohde < rarohde at gmail.com> wrote:
> > > On 1/13/08, David Gerard <dgerard at gmail.com> wrote:
> > > >
> > > > <snip>
> > > > One of the best protections we have against the Foundation being
> taken
> > > > over by insane space aliens is good database dumps.
> > >
> > > And how long has it been since we had good database dumps?
> > >
> > > We haven't had an image dump in ages, and most of the major projects
> > > (enwiki, dewiki, frwiki, commons) routinely fail to generate full
> > history
> > > dumps.
> > >
> > > I assume it's not intentional, but at the moment it would be very
> > difficult
> > > to fork the major projects in anything approaching a comprehensive
> way.
> > >
> > You don't really need the full database dump to fork.  All you need is
> > the current database dump and the stub dump with the list of authors.
> > You'd lose some textual information this way, but not really that
> > much.  And with the money and time you'd have to put into creating a
> > viable fork it wouldn't be hard to get the rest through scraping
> > and/or a live feed purchase anyway.
> >
> > <snip>
>
>
>
> For several months enwiki's stub-meta-history has also failed (albeit
> silently, you don't notice unless you try downloading it).  There is no
> dump
> at all that contains all of enwiki's contribution history.
>
> As for scraping, don't kid yourself and think that is easy.  I've run
> large
> scale scraping efforts in the past.  For enwiki you are talking about >2
> million images in 2.1 million articles with 35 million edits.  A friendly
> scraper (e.g. one that paused a second or so between requests) could
> easily
> be running a few hundred days if it wanted to grab all of the images and
> edit history.  An unfriendly, mutli-threaded scraper could of course do
> better, but it would still likely take a couple weeks.
>
> -Robert Rohde
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: http://lists.wikimedia.org/mailman/listinfo/foundation-l
>


More information about the foundation-l mailing list