[Foundation-l] Thank you for discussing my Top 25 Nonprofit LIst

Gregory Maxwell gmaxwell at wikimedia.org
Wed Oct 3 22:53:40 UTC 2007


On 10/3/07, Matthew Britton <matthew.britton at btinternet.com> wrote:
> I'll try to clarify what geni is saying. The Wikimedia Foundation relies
> exclusively on donations and has a very tight budget. It can only buy as
> much hardware as it can afford, and can only just afford enough to keep
> the sites running. (The toolserver had to be donated separately).

It's a bit misleading to characterize this as a poor-wikimedia issue.
Handling this amount of data is hard for anyone.  It's just that while
other sites need a higher degree of data for things like selling
themselves to advertisers, we don't... so efforts have been spent
elsewhere instead.

Historically we've only collected the information that we need for
capacity planning.  I linked to that stuff up thread.

> The resources just aren't available to completely log all site traffic -
> it would require scripts to process the mess of data generated at a fast
> enough pace to keep up without using up precious CPU time, and a whole
> load of extra disk space to store this data.

As of ~January, we send records of every access to an analysis system.
 Prior to then technical issues prevented us from collecting that kind
of data.

On that system we log (to disk) 1:100 and 1:1000 samples of the
traffic.  Logging all accesses to disk would result in, as I said
before, about 0.6 TB of log data per day. We'd run out disk rather
quickly. ;)

We can send the data (at a configurable sample rate) to other hosts,
or to programs for analysis.

We have at least some resources to run some analysis programs but they
must be very efficient unless they are to be run only on infrequently
sampled data.

We just don't have the analysis programs.

I checked a simple aggregator for pageview stats into SVN last night.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/counter/fast_counter.c?view=markup

I've got a unique viewers by project/country one almost done, I'll
probably check it in tonight.

> It's not possible to just "release all log data", because it doesn't exist.

Thats not correct anymore.

Complete data is not stored, but it is now collected and can be
transmitted. ... It's not possible to "release all log data" because
there are have ethical, legal, and procedural obligations to avoid
endangering the privacy of readers/editors with sloppy disclosures.



More information about the foundation-l mailing list