On Sun, Jun 29, 2008 at 10:03 AM, Lars Aronsson <lars(a)aronsson.se> wrote:
I'd like to propose a quality metric: The difference in rank
between the article count and the compressed database size.
I think this is a good metric, especially because it's a relative
metric (since it's effectively comparing projects against their peers
to see how mature they are).
Someone earlier was discussing article sizes, so I hacked up a script
to graph the distribution of article sizes:
http://www.toolserver.org/~thebainer/articlesizes/
Most graphs share the same basic shape, with a roughly logarithmic
distribution once you get past the initial peak (see the English
Wikipedia graph for an example of what I mean), but some are
different, and it tends to coincide with what has already been
observed.
The Swedish Wikipedia was (when this table was
compiled) the 10th
biggest by article count, but the 12th biggest by compressed
database size, so its quality is 10 - 12 = -2.
Swedish Wikipedia is distributed in almost exactly the same way as
English Wikipedia, with the difference being that its average size is
less than half that of En's, at around 1900 bytes.
The Russian Wikipedia was the 11th by article count,
but 9th by
compressed database size, so its quality is +2. This doesn't mean
the Russian Wikipedia is better than the English one, only that it
is better than (two of) its peers of similar size.
Not only does the Russian Wikipedia have a high average article size
(about 5500 bytes, compared with, for example, English Wikipedia at
around 4100 bytes) but its graph, which has multiple peaks, seems to
show that, unlike many other projects, it has more mature, medium-size
articles than it does stubs.
The Volapük Wikipedia was the 15th by article count,
but the worse
than the 30th by compressed database size (the table is
incomplete), so its quality is worse than -15.
The Volapük Wikipedia has an unusual distribution, with two peaks. One
is in the usual place, just below the average size (which is low, at
just over 1000 bytes) while the other is around 2 - 2.5kb, which
corresponds to the size of all the geography stubs created by
SmeiraBot.
--
Stephen Bain
stephen.bain(a)gmail.com