Well, I think Nick proposal would be a big improvement indeed...
Presently, the Python tool I' m developing for quantitative analysis based on db
dumps has to loop searching the latest valid dump for any given wikipedia (trying every
posible date in the url until I find the correct file...).
Despite that, reading Erik's comments I' ve realized that I should also check the
size of dumps looking for odd values. But... who knows the "correct" size of a
certain dump? (ok, other than enwiki).
So info about dates, size, and md5 sum for every valid dump is *really* interesting.
Nick Jenkins <nickpj(a)gmail.com> escribió: > > The "latest" directory
is not useful for this purpose (e.g.
http://download.wikipedia.org/enwiki/latest/ points
to
files from
approx
Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
Yes, but how often does somebody intentionally download and use every single file from a
dump? Most people need either one or two of
the dump files; the rest are simply irrelevant to them.
The latest directory is using a lowest-common-denominator approach (latest run where
everything succeeded). This file would
essentially be a highest-common-denominator approach (latest successful version of each
individual file). Maybe both have their
place.
However, I've realised it would be useful to include for each data type the date on
which the dump run was started, e.g.:
---------------------------------------
A few statistics such as the page count.
http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sq…
+ 20060925
451
2006-09-24T16:29:01Z
e4defa79c36823c67ed4d937f8f7013c
---------------------------------------
.. that way anyone that needs multiple files can hold off downloading them until all the
"dump_run" fields match up, so as to more
easily avoid problems of mixing files from different dumps. (It's true that this field
can currently be pulled from the directory in
the field, but if a different field is used then the url can point just about anywhere,
such as potentially using different
hostnames for different dumps, or changing directory structure.)
Anyway, it's just a suggestion, and if you don't like it, well, there's not
much I can do about it ;-)
All the best,
Nick.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
---------------------------------
LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com