Analytics February 2016

analytics@lists.wikimedia.org

38 participants
28 discussions

by Ryan Kaldari

The Community Tech team is trying to find out stats about edit conflicts. It looks like there was a patch merged back in January to collect stats on this (https://gerrit.wikimedia.org/r/#/c/266760/2/includes/EditPage.php) but I can't figure out where this is actually collecting the stats at. It looks like it's using the BufferingStatsdDataFactory class, but I couldn't find any documentation on what this is or where it actually collects the stats and I don't have time at the moment to investigate more deeply. Question: Does anyone know where those stats are actually collected? Request: Could someone create some documentation on MediaWiki.org for stats collection and retrieval via BufferingStatsdDataFactory?

8 years, 3 months

Pagecounts dumps page title UTF-8 escaping

by Bo Han

Hello, I have a question about how page titles are escaped in the pagecounts dumps as found at http://dumps.wikimedia.org/other/pagecounts-all-sites/ and http://dumps.wikimedia.org/other/pagecounts-raw/. I'm wondering for a particular page title, what is the set of escaped page titles in the dumps that I should look for? I searched for all possible combinations of unescaped and escaped characters in the page title and found the ones with non-zero counts. The examples below are from the 2015-01-01T02 dump. On the "ru" domain, the page "Путин, Владимир Владимирович" for has an encoded page title: "%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87" (underscores replace spaces and every character but the comma is escaped). On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page titles: "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29" (everything except comma escaped) "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)" (everything except comma and parens escaped) "Мстители_(фильм,_2012) (nothing escaped)". On the "en" domain, the page "Spider-Man (2002 film)" has escaped page titles: "Spider-Man_%282002_film%29" (everything except parens escaped) "Spider-Man_(2002_film)" (nothing escaped) Is the logic for the escaping available somewhere? Thanks, Bo

8 years, 3 months

Fwd: [Wikitech-l] Tech Talk: A Hands-on Estimation Exercise, With Discussion: Feb 8th

by Pine W

Forwarding. Pine ---------- Forwarded message ---------- From: Rachel Farrand <rfarrand(a)wikimedia.org> Date: Fri, Feb 5, 2016 at 4:59 PM Subject: [Wikitech-l] Tech Talk: A Hands-on Estimation Exercise, With Discussion: Feb 8th To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Please join for the following tech talk: *Tech Talk**:* A Hands-on Estimation Exercise, With Discussion *Presenter:* Joel Aufrecht *Date:* February 8th, 2016 *Time: *18:30 UTC < http://www.timeanddate.com/worldclock/fixedtime.html?msg=Tech+Talk%3A+A+Han… > Link to live YouTube stream <http://www.youtube.com/watch?v=b-zLwTez46M> *IRC channel for questions/discussion:* #wikimedia-office *Summary: *Estimation is an unnatural activity for human brains, which tend to hide our own ignorance from us. This brown-bag begins with an exercise, adapted from Steve McConnell's software estimation training, in balancing accuracy with precision. The exercise is fully available to remotees. Facilitated discussion follows, on what we can learn from the exercise and on general estimation and forecasting topics as raised. _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

8 years, 3 months

Phabricator summary suffixes like flea hawk dove slug lama

by John Mark Vandenberg

8 years, 3 months

Re: [Analytics] Analytics Digest, Vol 48, Issue 10

by Bo Han

> Date: Thu, 4 Feb 2016 08:22:01 +0100 > From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com> > To: A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and "analytics." > <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] Pagecounts dumps page title UTF-8 escaping > Message-ID: <56B2FC19.6090105(a)gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > Bo Han, 04/02/2016 00:40: >> Is the logic for the escaping available somewhere? > > MediaWiki API does https://phabricator.wikimedia.org/T29849 > For the new pageviews API I got this reply on Unicode normalisation: > https://phabricator.wikimedia.org/T44259#1351880 > > (Phabricator is down right now; wait a couple hours or check > web.archive.org.) > > Nemo Thanks for the reply Nemo. I read over the two links but am still a little confused about the case for "Мстители (фильм, 2012)" on domain ru, which is escaped as: "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29" (everything but comma escaped) "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)" (everything but comma+parens escaped) "Мстители_(фильм,_2012)" (nothing escaped) Shouldn't the comma and parens be escaped as well, or is there a special case for reserved characters? If so, why are parens sometimes escaped and sometimes not? Maybe some of the variation has to do with how browsers encode/send the request? Bo

8 years, 3 months

WikimediaBot convention

by Marcel Ruiz Forns

Hi analytics list, In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps. What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses. Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way. Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings. How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive. So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it. Thanks a lot! (*) There is already another convention[2] for bots that EDIT Wikimedia content. [1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… [2] https://www.mediawiki.org/wiki/Manual:Bots -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

8 years, 3 months

Pageview stats tools

by Pine W

Hi Analytics folks, My understanding is that the new pageview definition, which excludes automata to a certain extent, is now published. I have a few questions: 1. Is stats.grok.se already transitioned to the new definition, or will it? 2. Is there a replacement for stats.grok.se planned or already available? A reliable substitute would be great, and it would be nice if we could either replace the existing on-wiki "page view statistics" link or add a supplemental link to the new resource. Apologizes if this information was already published and I missed it. Thanks, Pine

8 years, 3 months

Issues on Cluster

by Joseph Allemandou

Hi Analytics fellows, We are experiencing issues with loading data into the hadoop cluster, therefore blocking the full job pipeline. When fixed, the cluster will be heavily loaded trying to catch up, so please, be nice with it and don't run heavy jobs in the next hours. We'll keep you posted about resolution. Many thanks, and sorry for the inconvenience. Joseph

8 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2016