[Wikipedia-l] Mediametry survey (was )Re: Language versions' popularity vs. number of articles...)

Neil Harris usenet at tonal.clara.co.uk
Fri Mar 31 14:14:43 UTC 2006


Brion Vibber wrote:
> Miguel Chaves wrote:
>   
>> Hi, I wonder if wikipedia only relies on this sort of external statistics
>> (like Alexa) to gather information about visits to the sites.
>> Aren't there statistcs collected on wikipedia servers itself? This would be
>> more useful and reliable.
>>     
>
> Not at this time. At our traffic level, web server logs are too large to handle
> comfortably without a dedicated infrastructure, and we've been forced to simply
> disable them until something easier to handle gets set up.
>
> (If we were an ad-supported site, such statistics would be much much more
> important and we'd have put in the time and money for it a lot sooner.)
>
>   
>> BTW, if we want to know the popularity of an specific article (not a
>> specific wikipedia), is there a tool for that?
>>     
>
> Not really, sorry.
>
> -- brion vibber (brion @ pobox.com)
>
>   
Since the traffic is so vast, why not use random sampling? At each page 
hit, call a random-number generator (eg read four bytes from 
/dev/urandom, or call a seeded pseudo-random number routine), and make a 
log entry only if its result == 0 mod 1000. That way, the logs will be 
statistically representative, but only require a relatively tiny amount 
of disk I/O, compute time, and disk space.

Alternatively, you could log using UDP syslog, and have a listener that 
threw away 999 out of 1000 packets.

-- Neil




More information about the Wikipedia-l mailing list