[Foundation-l] Tragical dynamics: that run for the number of articles

Lars Aronsson lars at aronsson.se
Sun Jun 29 00:03:38 UTC 2008


Tomasz Ganicz wrote:

> And if there is no clear definition of what is "real" article 
> and what is not,

Apparently it was the 500k article event that caused Ziko to bring 
the topic up this time.  He's frustrated (and so am I) that 500K 
articles is reported as an achievement, when it is indeed doubtful 
what quality these articles have.  Still, I think he exaggerates 
the problem.

Earlier this year, when the topic came up on meta, it was because 
of which languages were featured as the top 10 on 
www.wikipedia.org, 
http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias

Since then, the Russian Wikipedia has gained the 10th position and 
Swedish ("the one with all the stubs") is down to 11th, so there 
is one problem less to care about.  During that discussion, I 
proposed to use the size of the compressed database dump 
(pages-articles.xml.bz2) as the official metric, since it both 
counts the total database size (one long article counts the same 
as two short ones) and it completely removes the impact of bot 
generated articles.  The compressed size of the Volapük Wikipedia 
is very small, becase the same patterns appear in many of its 
numerous articles.

On the talk page, there is a table where this is shown, and you 
can sort by column by clicking the little boxes,
http://meta.wikimedia.org/wiki/Talk:Top_Ten_Wikipedias#What_problem_do_we_want_to_solve

I'd like to propose a quality metric: The difference in rank 
between the article count and the compressed database size.

The English Wikipedia is the biggest (rank 1), whether you count 
articles or compressed database size.  So its quality is 0.

The Polish Wikipedia was the 4th by article count, but the 7th by 
compressed database size, for a quality of 4 - 7 = -3.

The Swedish Wikipedia was (when this table was compiled) the 10th 
biggest by article count, but the 12th biggest by compressed 
database size, so its quality is 10 - 12 = -2.

The Russian Wikipedia was the 11th by article count, but 9th by 
compressed database size, so its quality is +2. This doesn't mean 
the Russian Wikipedia is better than the English one, only that it 
is better than (two of) its peers of similar size.

The Volapük Wikipedia was the 15th by article count, but the worse 
than the 30th by compressed database size (the table is 
incomplete), so its quality is worse than -15.



-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se



More information about the foundation-l mailing list