Analytics August 2017

analytics@lists.wikimedia.org

16 participants
16 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 7 months

Migrated Reportcard with Updated Data

by Nuria Ruiz

Hello! The Analytics team would like to announce that we have migrated the reportcard to a new domain: https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-… The migrated reportcard includes both legacy and current pageview data, daily unique devices and new editors data. Pageview and devices data is updated daily but editor data is still updated ad-hoc. The team is working at this time on revamping the way we compute edit data and we hope to be able to provide monthly updates for the main edit metrics this quarter. Some of those will be visible in the reportcard but the new wikistats will have more detailed reports. You can follow the new wikistats project here: https://phabricator.wikimedia.org/T130256 Thanks, Nuria

6 years, 1 month

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

stat1002 and stat1003 deprecated. Please use new stat boxes

by Andrew Otto

Hi all! tl;dr: Stop using stat100[23] by September 1st. We’re finally replacing stat1002 and stat1003. These boxes are out of warranty, and are running Ubuntu Trusty, while most of the production fleet is already on Debian Jessie or even Debian Stretch. stat1005 is the new stat1002 replacement. If you have access to stat1002, you also have access to stat1005. I’ve copied over home directories from stat1002. stat1006 is the new stat1003 replacement. If you have access to stat1003, you also have access to stat1006. I’ve copied over home directories from stat1003. I have not migrated any personal cron jobs running on stat1002 or stat1003. I need your help for this! Both of these boxes are running Debian Stretch. As such, packages that your work depends on may have upgraded. Please log into the new boxes and try stuff out! If you find anything that doesn’t work, please let me know by commenting on https://phabricator.wikimedia.org/T152712. Please be fully migrated to the new nodes by September 1st. This will give us enough time to fully decommission stat1002 and stat1003 by the end of this quarter. I’ve only done a single rsync of home directories. If there is new data on stat1002 or stat1003 that you want rsynced over, let me know on the ticket. A few notes: - stat1002 used to have /a. This has been removed in favor of /srv. /a no longer exists. - Home directories are now much larger. You no longer need to create personal directories in /srv. - /tmp is still small, so please be careful. If you are running long jobs that generate temporary data, please have those jobs write into your home directory, rather than /tmp. - We might implement user home directory quotas in the future. Thanks all! I’ll send another email in about a months time to remind you of the impending deadline of Sept 1. -Andrew Otto

6 years, 8 months

EventStreams Outage

by Andrew Otto

Hi all, EventStreams just experienced a 24 hour ‘outage’. There were no dropped messages, but for about 24 hours no messages were sent to connected EventStreams clients. I’ve written up the Incident Report here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170829-EventSt… The worst part about this is that we didn’t know that there was a problem until a user notified me on IRC. We monitor and alert on pieces of EventStreams infrastructure, but don’t monitor topic volume, as it varies and is hard to get right. However, this shouldn’t have taken 24 hours and a user for us (me) to notice, so I’ve created https://phabricator.wikimedia.org/T174493 to help us catch something like this in the future. Apologies if this caused any inconvenience. -Andrew Otto Systems Engineer, Wikimedia Foundation

6 years, 8 months

Hive issue yesterday

by Joseph Allemandou

Hi Analytics Fellows, *TL;DR:* Yesterday we broke and fixed hive wmf.webrequest table. Jobs not monitored by the Analytics team might have failed - Check your logs :) *Long story:* Yesterday at 9am UTC we deployed a change to the hive wmf.webrequest table that broke some of its functionality. More precisely, queries to the table that needed to read parquet columns in detail would fail with a hive internal error. The problem had gone unnoticed for a few hours since most of our complex computation jobs run only at night, and we only got aware of it after some hours (~18am UTC, kudos @bearloga!). We quickly fixed the issue and restarted the needed jobs over the problematic period. Given the type of failure of the jobs with the problem, we are sure that there have been no data corruption: jobs would fail even before starting to try to compute anything. For production jobs we monitor, we know which jobs have failed and we've taken care of it, however for jobs that are not monitored (report-updater, manual scripts etc), some silent failures might have occurred. Please check your logs :) Cheers -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

6 years, 8 months

Research Showcase Wednesday, August 23, 2017 at 11:30 AM (PST) 18:30 UTC

by Sarah R

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, August 23, 2017 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=Fa0Ztv2iF4w As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#August_2017>. This month's presentation: Sneha Narayan (Northwestern University) *The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users* Integrating new users into a community with complex norms presents a challenge for peer production projects like Wikipedia. We present The Wikipedia Adventure (TWA): an interactive tutorial that offers a structured and gamified introduction to Wikipedia. In addition to describing the design of the system, we present two empirical evaluations. First, we report on a survey of users, who responded very positively to the tutorial. Second, we report results from a large-scale invitation-based field experiment that tests whether using TWA increased newcomers' subsequent contributions to Wikipedia. We find no effect of either using the tutorial or of being invited to do so over a period of 180 days. We conclude that TWA produces a positive socialization experience for those who choose to use it, but that it does not alter patterns of newcomer activity. We reflect on the implications of these mixed results for the evaluation of similar social computing systems. Andrew Su (Scripps Research Institute) *The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge* The Gene Wiki project began in 2007 with the goal of creating a collaboratively-written, community-reviewed, and continuously-updated review article for every human gene within Wikipedia. In 2013, shortly after the creation of the Wikidata project, the project expanded to include the organization and integration of structured biomedical data. This talk will focus on our current and future work, including efforts to encourage contributions from biomedical domain experts, to build custom applications that use Wikidata as the back-end knowledge base, and to promote CC0-licensing among biomedical knowledge resources. Comments, feedback and contributions are welcome at https://github.com/SuLab/genewikicentral and https://www.wikidata.org/wiki/WD:MB. Kindly, Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

6 years, 8 months

Anybody know about stats.grok.se going down?

by Vipul Naik

stats.grok.se (a source of pageview stats for the time before the Wikimedia API became available) has been down for about a week. I tried emailing Henrik Abelsson, whom I've previously contacted when the site had issues, but haven't received a response this time. Any ideas on why it's down and whom to reach out to to help resolve the issue? Vipul

6 years, 8 months

Resources stat1005

by Adrian Bielefeldt

Hello everyone, I wanted to ask about resource allocation on stat1005. We <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> need quite a bit since we process every entry in wdqs_extract and I was wondering how many cores and how much memory we can use without conflicting with anyone else. Greetings, Adrian

6 years, 8 months

Article creation stats

by Morten Wang

Hello everyone, I'm currently working gathering data for the Autoconfirmed article creation trial project[1]. One of the measures we're interested in is the number of new articles, both surviving and deleted, that is created per day. I know that recent data is logged through EventBus, but if possible I'd would also like to have historic stats on this (e.g. going back a handful of years). Would there happen to be a dataset of that available somewhere? References: 1: https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_tri… Cheers, Morten

6 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics August 2017