Analytics April 2016

analytics@lists.wikimedia.org

34 participants
26 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 7 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Request stream data set for cache tuning

by Daniel Berger

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

7 years, 8 months

Video view stats

by Andrew Gray

Hi all, I hacked up a very quick count of the 2015 video viewing aggregate figures, using the data that Bartosz put together last year - with the caveat that the data only goes up to 10 December, but it's probably indicative of whole-year trends. I haven't yet tried to merge in the 11-31/12 data. Nothing very insightful but I don't recall seeing it done before, so it might be of interest! http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/ The headline figure is that we had about three billion (!!) video/audio plays during the year, and that some of the most popular items are insanely popular - the most popular was viewed an average of 42,000 times a day, every day. Pine: the video you asked about in the other thread was viewed 187,899 times from 31/10/15 to 10/12/15. So there's half your answer :-) -- - Andrew Gray andrew.gray(a)dunelm.org.uk

7 years, 11 months

Wikipedia Clickstream dataset refreshed (March 2016)

by Dario Taraborelli

Hey all, heads up that a refreshed Wikipedia Clickstream dataset is now available for March 2016, containing 25 million (referer, resource) pairs extracted from about 7 billion webrequests. https://dx.doi.org/10.6084/m9.figshare.1305770.v16 Ellery (the author of the dataset) is cc'ed if you have any questions, or you can chime in on the talk page of the dataset entry on Meta <https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream>. Show us what you do with this data, if you use it in your research. Dario *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

8 years

statistics about user agents per page or per namespace

by Amir E. Aharoni

Would it be crazy to ask for statistics of user agents per page or per namespace? I'd hypothesize, for example, that IE is used much less outside of the article and portal namespaces. In case you're wondering what is it useful for: When I have a patch that requires browser compatibility trickery, I may want to invest less time in IE compatibility in a page that is unlikely to be viewed in IE much. -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

8 years

API usage advice

by Sander Ubink

Hi all, as a new subscriber to this mailing list I would like to introduce myself. My name is Sander, I'm a student and I'm currently working on a project using Wikimedia APIs. A Dutch cultural institution has requested a team of students to analyze how their uploaded material is being used. Some examples of what they're interested in is knowing where their material is reused, how many visitors view to pages, if visitors open the media on the page, etc. Which APIs would you suggest we should look at that could have valuable information? Also, is there any general documentation about the various APIs? Any advice would be greatly appreciated. Regards, Sander

8 years

Developer metrics workshop next to OSCON

by Quim Gil

Hi, is there anybody in this list planning to attend OSCON (Austin, May 16-19)? http://conferences.oreilly.com/oscon/open-source-us Next to that event there will be a workshop about Software Development Analytics and the new Grimoire toolkit platform, on May 16th (10am - 1pm CDT ): https://www.eventbrite.com/e/software-development-analytics-workshop-ticket… The registration is not free, but the organizers (Bitergia, the developers of http://korma.wmflabs.org/ ) are offering us a couple of invitations. Andre and I are not attending OSCON (we have participated in the FOSDEM edition of this workshop). If you or someone you know is interested, contact me. -- Quim Gil Engineering Community Manager @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

8 years

Planned downtime/maintenance for stat1001.eqiad.wmnet

by Luca Toscano

Hi all, as part of https://phabricator.wikimedia.org/T76348 the Analytics team is going to re-image stat1001.wikimedia.org with Debian Jessie. The activity will start on Monday May 2nd at 14:00 PM CEST (UTC+2). Three major services will not be available during the downtime: - datasets.wikimedia.org - stats.wikimedia.org - metrics.wikimedia.org (also known as metrics.wmflabs.org) Notable consequences: - Datasets will not be available for download; - Dashboards using data from datasets.wikimedia.org (most dashboards) will not have data during this period; - Some html reports may not be available (examples: vital-signs.wmflabs.org, browser-reports.wmflabs.org, mobile-reportcard.wmflabs.org, ee-dashboard.wmflabs.org/dashboards/enwiki-features, edit-analysis.wmflabs.org/compare) The downtime should last a maximum of four hours, but please refer to the aforementioned Phabricator task for up to date timings and information during the next days. A backup will be done before starting including all the home directories. For any issue or queries please reach out to the Analytics team mailing list or join the #wikimedia-analytics IRC channel on Freenode. Thanks! Regards, Luca

8 years

Re: [Analytics] Analytics Digest, Vol 50, Issue 21

by Edo de Roo

Gergo Tisza makes a valid point. The nl-wiki has 300.000-350.000 hits on the main page per day. The rest of the top10 drops quickly down to 5000 hits per day, a reasonable amount. But the 1.500.000 unique visitors per day then seems overstated, when I do a rough estimate, it looks like 1.500.000 is the total number of page views, the number of unique devices must be a lot smaller then. See https://wikimedia.org/api/rest_v1/metrics/unique-devices/nl.wikipedia.org/a… Edo de Roo nl-wiki, wikidata On Tue, Apr 19, 2016 at 11:50 PM, <analytics-request(a)lists.wikimedia.org> wrote: > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > > Today's Topics: > > 1. Unique Devices data available on API (Nuria Ruiz) > 2. Hive & Oozie downtime tomorrow (Andrew Otto) > 3. Re: Unique Devices data available on API (Gergo Tisza) > 4. Re: Unique Devices data available on API (Kevin Leduc) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 19 Apr 2016 12:17:12 -0700 > From: Nuria Ruiz <nuria(a)wikimedia.org> > To: "A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics." > <analytics(a)lists.wikimedia.org>, Wikimedia developers > <wikitech-l(a)lists.wikimedia.org>, > wiki-research-l(a)lists.wikimedia.org > Subject: [Analytics] Unique Devices data available on API > Message-ID: > <CAMpYYkGngUfQOu-f6sL1TVTDbDUX_= > MG9FXEqUV5xmZVZL3nqQ(a)mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello! > > The analytics team is happy to announce that the Unique Devices data is now > available to be queried programmatically via an API. > > This means that getting the daily number of unique devices [1] for English > Wikipedia for the month of February 2016, for all sites (desktop and > mobile) is as easy as launching this query: > > > https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/a… > > You can get started by taking a look at our docs: > https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices#Quick_Start > > If you are not familiar with the Unique Devices data the main thing you > need to know is that > is a good proxy metric to measure Unique Users, more info below. > > Since 2009, the Wikimedia Foundation used comScore to report data about > unique web visitors. In January 2016, however, we decided to stop > reporting comScore numbers [2] because of certain limitations in the > methodology, these limitations translated into misreported mobile usage. We > are now ready to replace comscore numbers with the Unique Devices Dataset . > While unique devices does not equal unique visitors, it is a good proxy for > that metric, meaning that a major increase in the number of unique devices > is likely to come from an increase in distinct users. We understand that > counting uniques raises fairly big privacy concerns and we use a very > private conscious way to count unique devices, it does not include any > cookie by which your browser history can be tracked [3]. > > > [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices > [2] [https://meta.wikimedia.org/wiki/ComScore/Announcement > [3] > > https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni… > devices.3F >

8 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics April 2016