We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, January
17, 2018 at 11:30 AM (PST) 19:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=L-1uzYYneUo
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here.
This month's presentation:
*What motivates experts to contribute to public information goods? A field
experiment at Wikipedia*
By Yan Chen, University of Michigan
Wikipedia is among the most important information sources for the general
public. Motivating domain experts to contribute to Wikipedia can improve
the accuracy and completeness of its content. In a field experiment, we
examine the incentives which might motivate scholars to contribute their
expertise to Wikipedia. We vary the mentioning of likely citation, public
acknowledgement and the number of views an article receives. We find that
experts are significantly more interested in contributing when citation
benefit is mentioned. Furthermore, cosine similarity between a Wikipedia
article and the expert's paper abstract is the most significant factor
leading to more and higher-quality contributions, indicating that better
matching is a crucial factor in motivating contributions to public
information goods. Other factors correlated with contribution include
social distance and researcher reputation.
*Wikihounding on Wikipedia*
By Caroline Sinders, WMF
Wikihounding (a form of digital stalking on Wikipedia) is incredibly
qualitative and quantitive. What makes wikihounding different then
mentoring? It's the context of the action or the intention. However, all
interactions inside of a digital space has a quantitive aspect to it, every
comment, revert, etc is a data point. By analyzing data points
comparatively inside of wikihounding cases and reading some of the cases,
we can create a baseline for what are the actual overlapping similarities
inside of wikihounding to study what makes up wikihounding. Wikihounding
currently has a fairly loose definition. Wikihounding, as defined by the
Harassment policy on en:wp, is: “the singling out of one or more editors,
joining discussions on multiple pages or topics they may edit or multiple
debates where they contribute, to repeatedly confront or inhibit their
work. This is with an apparent aim of creating irritation, annoyance or
distress to the other editor. Wikihounding usually involves following the
target from place to place on Wikipedia.” This definition doesn't outline
parameters around cases such as frequency of interaction, duration, or
minimum reverts, nor is there a lot known about what a standard or
canonical case of wikihounding looks like. What is the average wikihounding
case? This talk will cover the approaches myself and members of the
research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are
taking on starting this research project.
--
Lani Goto
Project Assistant, Engineering Admin
Hi everybody,
I am about to reboot eventlog1001 for kernel upgrades. This host runs all
the Eventlogging daemons that pull data from Kafka, elaborate it and then
push to Mysql. The maintenance is needed to deploy the new Linux Kernel
that fixes the Meltdown vulnerability.
If you see a dip in Eventlogging schema metrics (
https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1) it
will be my fault :)
Thanks!
Luca (on behalf of the Analytics team)
Hi everybody,
as you already know we are deploying the new Linux kernel to fix the
Meltdown vulnerability across the production fleet. This means that I need
to reboot all the stat boxes (stat100[456]) and also analytics1003 (running
Oozie, Camus, Hive, etc..), probably interfering with the work that you are
doing on tmux/screen sessions :)
The maintenance is planned for Wed 17th at around 10 AM CET.
These will be the side effects:
1) All the tmux/screen sessions on stat100[456] will be killed by the
reboots, so please reach out to me or the Analytics team if you have
important work that cannot be stopped/resumed, we'll try to do our best to
reschedule the maintenance accordingly.
2) A lot of Hadoop cluster activities (like regular Oozie jobs, Hive
queries, etc..) will need to be stopped to allow a smooth reboot of
analytics1003. This might cause some of your queries to fail, again please
reach out to me if this is a problem for your work.
Thanks a lot and sorry for the trouble!
Luca (on behalf of the Analytics team)
Hi everybody,
dbstore1002 (also known as analytics-store.eqiad.wmnet) needs to be
shutdown for maintenance tomorrow Jan 09 at around 15:00 UTC for
https://phabricator.wikimedia.org/T183771. We don't expect the downtime to
last more than a couple of hours, but there are some outstanding issues
that might require more time so we are not completely sure.
I will follow up in this email thread if any problem arises. As always,
please follow up with me or anybody in the Analytics team if this
maintenance affects any important work that you are doing, we'll try to do
our best to reschedule it accordingly.
Thanks and sorry for the trouble!
Luca (on behalf of the Analytics team)
Hi everybody,
as part of https://phabricator.wikimedia.org/T168414 the Analytics team
needs to execute a lot of alter tables to the log database to be able to
complete the work of data purging/sanitization. The plan is to stop the
Eventlogging Mysql Consumer on eventlog1001 tomorrow Jan 03 during the EU
morning, and keep it stopped until all the work is done. We estimate that
this will require 2/3 days, so this means that data will not be replicated
on the analytics-slave (db1108) during this timeframe. Event Logging will
keep working as expected, the only thing that will be stopped is inserting
new data on db1107.
Please follow up with me (elukey on IRC) or with the Analytics team if this
maintenance affects a important work that you are doing, so we'll be able
to decide together a better date.
Thanks a lot!
Luca (on behalf of the Analytics team)