Analytics January 2018

analytics@lists.wikimedia.org

16 participants
16 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 4 months

Research Showcase Wednesday, January 17, 2018

by Lani Goto

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, January 17, 2018 at 11:30 AM (PST) 19:30 UTC. YouTube stream: https://www.youtube.com/watch?v=L-1uzYYneUo As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here. This month's presentation: *What motivates experts to contribute to public information goods? A field experiment at Wikipedia* By Yan Chen, University of Michigan Wikipedia is among the most important information sources for the general public. Motivating domain experts to contribute to Wikipedia can improve the accuracy and completeness of its content. In a field experiment, we examine the incentives which might motivate scholars to contribute their expertise to Wikipedia. We vary the mentioning of likely citation, public acknowledgement and the number of views an article receives. We find that experts are significantly more interested in contributing when citation benefit is mentioned. Furthermore, cosine similarity between a Wikipedia article and the expert's paper abstract is the most significant factor leading to more and higher-quality contributions, indicating that better matching is a crucial factor in motivating contributions to public information goods. Other factors correlated with contribution include social distance and researcher reputation. *Wikihounding on Wikipedia* By Caroline Sinders, WMF Wikihounding (a form of digital stalking on Wikipedia) is incredibly qualitative and quantitive. What makes wikihounding different then mentoring? It's the context of the action or the intention. However, all interactions inside of a digital space has a quantitive aspect to it, every comment, revert, etc is a data point. By analyzing data points comparatively inside of wikihounding cases and reading some of the cases, we can create a baseline for what are the actual overlapping similarities inside of wikihounding to study what makes up wikihounding. Wikihounding currently has a fairly loose definition. Wikihounding, as defined by the Harassment policy on en:wp, is: “the singling out of one or more editors, joining discussions on multiple pages or topics they may edit or multiple debates where they contribute, to repeatedly confront or inhibit their work. This is with an apparent aim of creating irritation, annoyance or distress to the other editor. Wikihounding usually involves following the target from place to place on Wikipedia.” This definition doesn't outline parameters around cases such as frequency of interaction, duration, or minimum reverts, nor is there a lot known about what a standard or canonical case of wikihounding looks like. What is the average wikihounding case? This talk will cover the approaches myself and members of the research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are taking on starting this research project. -- Lani Goto Project Assistant, Engineering Admin

6 years, 4 months

Reboot of eventlog1001 for kernel upgrades

by Luca Toscano

Hi everybody, I am about to reboot eventlog1001 for kernel upgrades. This host runs all the Eventlogging daemons that pull data from Kafka, elaborate it and then push to Mysql. The maintenance is needed to deploy the new Linux Kernel that fixes the Meltdown vulnerability. If you see a dip in Eventlogging schema metrics ( https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1) it will be my fault :) Thanks! Luca (on behalf of the Analytics team)

6 years, 4 months

Analytics hosts reboot announcement - Wed 17th 10 AM CET

by Luca Toscano

Hi everybody, as you already know we are deploying the new Linux kernel to fix the Meltdown vulnerability across the production fleet. This means that I need to reboot all the stat boxes (stat100[456]) and also analytics1003 (running Oozie, Camus, Hive, etc..), probably interfering with the work that you are doing on tmux/screen sessions :) The maintenance is planned for Wed 17th at around 10 AM CET. These will be the side effects: 1) All the tmux/screen sessions on stat100[456] will be killed by the reboots, so please reach out to me or the Analytics team if you have important work that cannot be stopped/resumed, we'll try to do our best to reschedule the maintenance accordingly. 2) A lot of Hadoop cluster activities (like regular Oozie jobs, Hive queries, etc..) will need to be stopped to allow a smooth reboot of analytics1003. This might cause some of your queries to fail, again please reach out to me if this is a problem for your work. Thanks a lot and sorry for the trouble! Luca (on behalf of the Analytics team)

6 years, 4 months

dbstore1002 / analytics-store.eqiad.wmnet downtime announcement for Jan 09 15:00 UTC

by Luca Toscano

Hi everybody, dbstore1002 (also known as analytics-store.eqiad.wmnet) needs to be shutdown for maintenance tomorrow Jan 09 at around 15:00 UTC for https://phabricator.wikimedia.org/T183771. We don't expect the downtime to last more than a couple of hours, but there are some outstanding issues that might require more time so we are not completely sure. I will follow up in this email thread if any problem arises. As always, please follow up with me or anybody in the Analytics team if this maintenance affects any important work that you are doing, we'll try to do our best to reschedule it accordingly. Thanks and sorry for the trouble! Luca (on behalf of the Analytics team)

6 years, 4 months

Maintenance window for db1107 (Event Logging log database)

by Luca Toscano

Hi everybody, as part of https://phabricator.wikimedia.org/T168414 the Analytics team needs to execute a lot of alter tables to the log database to be able to complete the work of data purging/sanitization. The plan is to stop the Eventlogging Mysql Consumer on eventlog1001 tomorrow Jan 03 during the EU morning, and keep it stopped until all the work is done. We estimate that this will require 2/3 days, so this means that data will not be replicated on the analytics-slave (db1108) during this timeframe. Event Logging will keep working as expected, the only thing that will be stopped is inserting new data on db1107. Please follow up with me (elukey on IRC) or with the Analytics team if this maintenance affects a important work that you are doing, so we'll be able to decide together a better date. Thanks a lot! Luca (on behalf of the Analytics team)

6 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2018