Analytics February 2015

analytics@lists.wikimedia.org

45 participants
44 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Relevant Content Availability

by Abdel Samad, Rawia

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

8 years, 6 months

Monthly compressed traffic delay

by Michael Hale

Hello, I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though. I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer? Thanks, Michael

8 years, 10 months

Contribute

by Ron Baasland

Hello, My username is rbaasland and I would like to contribute to the analytics project. I was wondering if I could have access to the project, or how I go about contributing to this project? Thank you very much, Ron Baasland

8 years, 10 months

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

by Dario Taraborelli

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

8 years, 11 months

Re: [Analytics] stats.grok.se not updating

by Kevin Leduc

Thank you! Would you mind posting a note on Analytics(a)lists.wikimedia.org when it is working normally again? On Wed, Feb 11, 2015 at 1:36 PM, Henrik Abelsson <henrik(a)abelsson.com> wrote: > Hi Kevin, > > Looking into it! > > -henrik > > > On 11/02/15 16:36, Kevin Leduc wrote: > > Hi Henrik, > > stats.grok.se has missing data in the last week. Can you restart the > service to see if that helps? > > Thanks! > Kevin Leduc > Analytics Product Manager > > >

9 years

Provenance Params

by Adam Baso

Hi all - I'm checking with people in ops, but we're planning to add a well defined parameter to the end of URLs to see the level of clickthroughs on such links. For example: https://en.wikipedia.org/wiki/Epirus?analytics=ios_share_a_fact_v1 (If there are existing params on the URL - not an issue so far that I know of for the apps as they canonicalize the title and URL - then the param would be last in the ampersand separated query string parameter.) And then we'd use Varnish to remove the parameter to reduce the risk of cache fragmentation. We "know" this is probably only a short term solution, and as a follow up from the meeting with the people on the CC line, I'm emailing to open the discussion on options for a more generic option. So far I think there are a few options from what we've discussed, if we're to support additional bucketing. (1) More parameters (e.g., ?analytics=ios_share_a_fact&version=1) Downside: potentially harder to standardize and remove things from the URL (2) More conventional provenance (e.g., https://en.wikipedia.org/w/index.php?title=Castle&oldid=645632619/ref=_wref…<...more provenance info as desired>/). Downside: technically speaking, may break the schema of well-formed titles (3) Rely upon (1) or (2), or perhaps an even more RESTful shortlinker (it could have features like target - web or w:// wor wiki:// protocol or whatever - versioning, etc.). Downside: maybe a little more work to stand up service. As we recalled, there's an extension out there that may, perhaps with some tweaks, fit the build. -Adam

9 years, 1 month

Rough estimate of percentage of requests without Javascript enabled/capable clients

by Nuria Ruiz

Gabriel: I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful. TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%. Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript Thanks, Nuria

9 years, 1 month

Virtual file view hack for Media Viewer views

by Gergo Tisza

Hi all, Erik Zachte is working on file view stats and is looking for a way to track Media Viewer image views (for which there is no 1:1 relation between server hits and actual image views); after some back and forth in https://phabricator.wikimedia.org/T86914 I proposed the following hack: whenever the javascript code in MediaViewer determines that an image view happened (e.g. an image has been displayed for a certain amount of time), it makes a request to a certain fake image, say upload.wikimedia.org/wikipedia/commons/thumb/0/00/Virtual-imageview-<real image name>/<size>px-thumbnail.<ext> . These hits can than be easily filtered from the varnish request logs and added to the normal requests. We would add a rule to Vagrant to make sure it does not try to look up such requests in Swift but returns a 404 immediately. This would be a temporary workaround until there is a proper way to log virtual image views, such as EventLogging with a non-SQL backend. Do you see any fundamental problem with this?

9 years, 1 month

eventlogging master

by Sean Pringle

I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons: - Backfilling causes replication lag. Using faster out-of-band replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches. - Master disk space. Even with the data purging discussed at the MW Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services. - EL is the service most sensitive to connection dropouts. Recently Ori and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services. I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master. For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so? Sean

9 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2015