On Fri, Jan 8, 2010 at 8:25 PM, Robert Rohde <rarohde(a)gmail.com> wrote:
While I certainly can't fault your good will, I do
find it disturbing
that it was necessary. Ideally, Wikimedia should have internal
backups of sufficient quality that we don't have to depend on what
third parties happen to have saved for any circumstance short of
meteors falling from the heavens.
Yea, well, you can't easily eliminate all the internal points of
failure. "someone with root loses control of their access and someone
nasty wipes everything" is really hard to protect against with online
systems.
Avoiding the case where some failure is reliably replicated among all
of WMF's copies (which was the case in the deletions I recovered, they
were redundant copies, which were deleted too) can be best
accomplished with an air-gap.
And meteors *do* fall, if rarely. WMF can be robust against that— for
only the price of making all the data available, which is something
worth doing for other principled and practical reasons.
Within wikimedia means that Wikimedia remains a single point of
failure. This is too easy to avoid. Disk space is cheap, and not your
problem. At least a few third parties will create and maintain full
copies and this is a good thing.
Moreover it
allowed things like image hashing before
we had that in the database, and it would allow perceptual lossy hash
matching if I ever got around to implementing tools to access the
output.
If the goal is some version of "do something useful for Wikimedia",
then it actually seems rather bizarre to have the first step be "copy
X TB of gradually changing data to privately owned and managed
servers". For Wikimedia applications, it would seem much more natural
to make tools and technology available to do such things within
Wikimedia. That way developers could work on such problems without
having to worry about how much disk space they can personally afford.
Again, there is nothing wrong with you generously doing such things
with your own resources, but ideally running duplicate repositories
for the benefit of Wikimedia should be unnecessary.
Within wikimedia means within Wikimedia's means, priorities, and
politics. Having it locally means that if I decide that I want to
decide to saturate a dozen cores computing perceptual hashes for a
week I don't have to convince anyone else that its a good use of
resources. I don't have to convince wikimedia to fund a project, I
don't have to take up resources which might be better used by someone
else, I don't have to set any expectations that I might not live up
to.
Of course, its great to have public resources 'locally' (which is what
the toolserver is for), it doesn't cover all cases.
There really
are use cases. Moreover, making complete copies of the
public data available as dumps to the public is a WMF board supported
initiative.
I agree with the goal of making WMF content available, but given that
we don't offer any image dump right now and a comprehensive dump as
such would be usable to almost no one, then I don't think a classic
dump is where we should start. Even you don't seem to want that. If
I understand correctly, you'd like to have an easier way to reliably
download individual image files. You wouldn't actually want to be
presented with some form of monolithic multi-terabyte tarball each
month.
No one wants the monolithic tarball. The way I got updates previously
was via a rsync push.
No one sane would suggest a monolithic tarball: it's too much of a
pain to produce!
Image dump != monolithic tarball.
But I think producing subsets is pretty much worthless. I can't think
of a valid use for any reasonably sized subset. ("All media used on
big wiki X" is a useful subset I've produced for people before, but
it's not small enough to be a big win vs a full copy)
[snip]
The general point I am trying to make is that if we
think about what
people really want, and how the files are likely to be used, then
there may be better delivery approaches than trying to create huge
image dumps.
If all is made available then everyone's wants can be satisfied. No
subset is going to get us there. Of course, there are a lot of
possibilities for the means of transmission, but I think it would be
most useful to assume that at least a few people are going to want to
grab everything.