Analytics August 2015

analytics@lists.wikimedia.org

21 participants
12 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Relevant Content Availability

by Abdel Samad, Rawia

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

8 years, 6 months

Pageviews definition + measurement for apps adding link previews + using RESTBase

by Bernd Sitzmann

This discussion is about needed updates of the definition and Analytics implementation for mobile apps page view metrics. There is also an associated Phab task[4]. Please add the proper Analytics project there. Background / Changes As you probably remember, the Android app splits a page view into two requests: one for the lead section and metadata, plus another one for the remainder. The mobile apps are going to change the way they load pages in two different ways: 1. We'll add a link preview when someone clicks on a link from a page. 2. We're planning on switching over the using RESTBase for loading pages and also the link preview (initially just the Android beta, ater more) This will have implications for the pageviews definition and how we count user engagement. The big question is Should we count link previews as a page view since it's an indication of user engagement? Or should there be a separate metric for link previews? Counting page views IIRC we currently count action=mobileview&sections=0 query parameters of api.php as a page view. When we publish link previews for all Android app users then we would either want to count also the calls to action=query&prop=extracts as a page view or add them to another metric. Once the apps use RESTBase the HTTPS requests will be very different: - Page view: Instead of action=mobileview&sections=0 the app would call the RESTBase endpoint for lead request[1] instead of the PHP API mentioned above. Then it would call [2]. - Link preview: Instead of action=query&prop=extracts it would call the lead request[1], too, since there is a lot of overlap. At least that our current plan. The advantage of that is that the client doesn't need to execute the lead request a second time if the user clicks on the link preview (-- either through caching or app logic.) So, in the RESTBase case we either want to count the mobile-html-sections-lead requests or the mobile-html-sections-remaining requests depending on what our definition for page views actually is. We could also add a query parameter or extra HTTP header to one of the mobile-html-sections-lead requests if we need to distinguish between previews and page views. Both the current PHP API and the RESTBase based metrics would need to be compatible and be collected in parallel since we cannot control when users update their apps. [1] https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Dilbert [2] https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-remaining/Di… [3] https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_ap… [4] https://phabricator.wikimedia.org/T109383 Cheers, Bernd

8 years, 7 months

Editor population stats for August

by Pine W

A number of us are discussing the year to date editor population stats. When can we anticipate seeing the August stats? It would be helpful to have them be published at least a week before the publication of the monthly Recent Research report for September. Thanks, Pine

8 years, 7 months

Webrequest loss on 08-03 and 08-10

by Andrew Otto

Hi all, Now that we’ve had a little space to analyze the problem, I wanted to call out a recent webrequest data loss issue that we experienced on two separate occasions. We attempted to upgrade to Kafka 0.8.2.1, and it wasn’t until the second attempt that we actually found the problem. Kafka 0.8.2.1 ships with a buggy version of Snappy[1] that causes messages to not be compressed properly. This caused a ~4x increase network and disk I/O around the cluster all at once. We’ve documented the incidents and the occasions of significant data loss here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka <https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#C… <https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#C…> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest <https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest> This loss will affect the output of pagecount* and pageview datasets, as well as other webrequest generated statistics. Please consider statistics that are generated from webrequest data using the following UTC hours unreliable: 2015-08-03T18:00 - 2015-08-03T23:00 2015-08-10T15:00 - 2015-08-10T21:00 2015-08-11T17:00 - 2015-08-11T18:00 Many apologies for any inconvenience this causes. We’ve learned a lot during this turmoil, and have a lot of ideas on how to hopefully prevent this from happening in the future, and also how to reduce loss and complexity if and when it does. The analytics engineering team will be doing a post mortem on this soon, in which we will document these ideas. Thanks, -Andrew Otto [1] https://issues.apache.org/jira/browse/KAFKA-2189 <https://issues.apache.org/jira/browse/KAFKA-2189>

8 years, 8 months

Why did WMF fork ua-parser?

by Christian Aistleitner

Hi, I've been asked in a private email why WMF forked ua-parser [1] (a library used to extract information from User-Agents headers). There is no need to discuss this is private, hence I am replying to the mailing list. TL;DR: It was no real fork. We just worked around issues with upstream's release management. ----------------------------------------------------- What follows is a bit detailed. But given the context I decided to better err on being over-verbose. Back in October 2014, WMF pushed towards analyzing User-Agent headers in the logs to for example allow more accurate estimations of how many requests WMF sees from Android vs. iPhone devices, which Browsers get used in which version etc. Extracting information from User-Agents is a bit tricky as there are quite some corner cases. So it was decided to use a third-party library for it. ua-parser [1] got chosen for this purpose. ua-parser comes with a Java build, so it naturally matched the log processing's Java eco-system. However, (at least) back then ua-parser did not offer compelling prebuilt jars, and ua-parser's versioning and release cycle of the Java part was broken. The latest release was about a year old, and no proper release was in sight. So all upstream gave us was a jar versioned as ua-parser-1.3.0-SNAPSHOT.jar Deploying such jar to the cluster is a bad idea, as its name does not give a clue on which commit it is based. For this concrete setting, there would be about 250 commits in ua-parser that would produce the same version number. That would make debugging hard and nix reproducability. Since WMF cannot do a proper release for ua-parser, the typical workaround for WMF in such cases is to produce a “wmf” branch in Gerrit and do “wmf” releases at known commits. And that's what the ua-parser “fork” in Gerrit does. Comparing upstream with the “fork” in Gerrit, the only difference is: https://gerrit.wikimedia.org/r/#/c/169204/ That commit allows for a wmf release, is tagged 1.3.0-wmf1 and results in an artifact name of ua-parser-1.3.0-wmf1.jar which (due to the 1.3.0-wmf1 tag) is good for releasing [2]. As one of the questions in the private email was whether WMF could switch back to upstream ... I hope you see that WMF never switched away from upstream and WMF never “forked” upstream. WMF only rolled their own release. If upstream now provides proper releases, sure, just use them :-) Have fun, Christian P.S.: * How can I find out who actually created a repository? Look at the first commit to the meta/config branch. Like here: https://git.wikimedia.org/commit/analytics%2Fua-parser/2fd5dc00ac9e087b307f… * How can I see the difference between branches? Use `git cherry` (Yes, really. Just “cherry”, no trailing “-pick”) An example session is at [3]. * How could one have found out about the wmf1 thing? For example from the IRC logs of the day from the commit [4]: [20:23:08] <ottomata> we can just make wmf1 be our release of the current master? [20:23:13] <qchris> k ---------------------------------------------------------------------- [1] Back then at https://github.com/tobie/ua-parser now the relevant repos for WMF seem to be at https://github.com/ua-parser/uap-core https://github.com/ua-parser/uap-java . [2] It made it into archiva: https://archiva.wikimedia.org/#artifact/ua_parser/ua-parser/1.3.0-wmf1 into the refinery-hive jars: https://gerrit.wikimedia.org/r/#/c/166142/11..14/refinery-hive/pom.xml and also to the cluster: https://gerrit.wikimedia.org/r/#/c/170373/1/refinery-hive/pom.xml https://gerrit.wikimedia.org/r/#/c/170375/ [3] _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:40:28 // exit code: 0 cwd: ~/tmp git clone https://github.com/tobie/ua-parser Cloning into 'ua-parser'... remote: Counting objects: 4507, done. remote: Total 4507 (delta 0), reused 0 (delta 0), pack-reused 4507 Receiving objects: 100% (4507/4507), 4.31 MiB | 923 KiB/s, done. Resolving deltas: 100% (2301/2301), done. _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:41:10 // exit code: 0 cwd: ~/tmp cd ua-parser _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:41:14 // exit code: 0 cwd: ~/tmp/ua-parser git remote add gerrit https://gerrit.wikimedia.org/r/analytics/ua-parser _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:41:33 // exit code: 0 cwd: ~/tmp/ua-parser git fetch gerrit remote: Finding sources: 100% (4/4) remote: Total 4 (delta 3), reused 4 (delta 3) Unpacking objects: 100% (4/4), done. From https://gerrit.wikimedia.org/r/analytics/ua-parser * [new branch] master -> gerrit/master * [new branch] wmf -> gerrit/wmf * [new tag] v1.3.0-wmf1 -> v1.3.0-wmf1 _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:41:38 // exit code: 0 cwd: ~/tmp/ua-parser git cherry origin/master gerrit/master _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:42:10 // exit code: 0 cwd: ~/tmp/ua-parser git cherry origin/master gerrit/wmf + 2a44875355b558d9f880a63c86630af229044a63 _________________________________________________________________ christian@spencer // jobs: 0 // time: 21:42:17 // exit code: 0 cwd: ~/tmp/ua-parser git cherry origin/master v1.3.0-wmf1 + 2a44875355b558d9f880a63c86630af229044a63 [4] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20141027.txt -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

8 years, 8 months

Vital Signs dashboard

by Neil P. Quinn

Hello all! Almost all of the graphs on the Vital Signs dashboard <https://vital-signs.wmflabs.org/> display no data (the only exception is legacy pageviews). Could someone explain to me why that is, and whether there's a plan to fix it? I ask because Vital Signs includes several metrics from the editor model, <https://meta.wikimedia.org/wiki/Research:Editor_model>which the Editing department really wants to track on an ongoing basis. I need to find out whether we need to pursue other ways of doing so. Thanks! -- Neil P. Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>, product analyst Wikimedia Foundation

8 years, 8 months

pageviews_hourly table

by Oliver Keyes

Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one? -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 8 months

Webrequest loss due to Kafka upgrade

by Andrew Otto

Hi all, We attempted a Kafka upgrade last[1] and this week[2], and during both occasions had incidents of webrequest data loss. We are still resolving these, and still nailing down an estimate of how much data was lost and when. One thing we do know: any webrequest_text related data since about 2016-08-11T16:00 is missing around (at least) 8% of data. Camus is busy reimporting this missing data from Kafka since that time, and jobs that have been run since then will be rerun. This includes pageview_hourly and any other webrequest related jobs. We will document what we know about what data is really gone when we know more and also let you know when the refined webrequest data after 2016-08-11T16:00 is ready for use. Really sorry for this inconvenience. We are scrambling to get everything back in order. -Andrew + Analytics Engineering Team [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka <https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka> [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka <https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka>

8 years, 8 months

Scaleable Event Systems recap

by Oliver Keyes

Heyo, Discovery team! (Analytics CCd) This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective). For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate! It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it. I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead! -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics August 2015