Andy Rabagliati wrote:
It will take me a week or so to get a good look at
these - but -
a question for the developers - am I right to only accept files
matching ./en/[0-9a-f]/../* from the archive ?
Presumably uploads are just hashed into these dirs ?
Yes, that's correct. The directory name is derived from the MD5 hash of
the filename.
There are a few pics that come with the mediawiki
software that I
would, naturally, leave alone.
In the first (Jun) archive /thumb/* was about 700Meg, and /archive/*
was similar. There were also a lot of encyclopedia pics in the
root dir - I threw them all away without noticing anything untoward.
In the real root directory there's symlinks to images in the other
directories, apparently left there to avoid breaking URLs used in an
earlier version of the software. Obviously tar has converted them from
symlinks to duplicates. You can delete them.
I might run a script over the archive and convert
large images
to ones of the same size but, say, 70% quality. I imagine I
could easily halve the archive size that way.
Quite likely.
If there are other regexes that would catch files
resized by the
server I would be very grateful for the hint.
The thumb directory contains all the images resized automatically,
although the ./en/[0-9a-f] directories will contain some duplicate
images resized by hand.
-- Tim Starling