Tim Starling wrote:
Reid Priedhorsky wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
Research, which is the human-computer interaction group at the
University of Minnesota.
We are currently working on some research which is investigating
Wikipedia contribution and vandalism. To this end, statistics on the
view rate of different articles would be extremely helpful to us --
something along the lines of Leon Weber's WikiCharts tool, but with a
larger limit (ideally all 1.7 million articles).
Producing such statistics will be a Google Summer of Code project this
summer. If you can't wait that long, then we can give you a sampled,
anonymised log stream to analyse.
Yes, summer would be too late: anonymised logs would be be excellent for
our purposes. Does "stream" mean that we would need to write a program
to listen to the real-time log stream, or could you give us files?
Gregory Maxwell wrote:
Greetings, describe for me what you ideal data
would look like.
Ideal data would be log files that just looked like:
Main Page\t1169499304.066
i.e., article titles as they appear in the XML dumps and request time.
A close second choice would be simply-anonymized logs, e.g.:
sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208
GET
http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
If the logs still contains duplicates due to requests being forwarded
between squids, we'd need pointers on how to resolve those.
Please let me know what the next step is. Thanks for your help!
Reid
Just a small aside: please keep us up-to-date on the outcome of the
research over on the Wiki-research-l mailing list. It's always
interesting (and potentially useful) to see how Wikipedia is used.
--
Oldak Quill (oldakquill(a)gmail.com)