The Community Tech team is trying to find out stats about edit conflicts.
It looks like there was a patch merged back in January to collect stats on
this (https://gerrit.wikimedia.org/r/#/c/266760/2/includes/EditPage.php)
but I can't figure out where this is actually collecting the stats at. It
looks like it's using the BufferingStatsdDataFactory class, but I couldn't
find any documentation on what this is or where it actually collects the
stats and I don't have time at the moment to investigate more deeply.
Question: Does anyone know where those stats are actually collected?
Request: Could someone create some documentation on MediaWiki.org for stats
collection and retrieval via BufferingStatsdDataFactory?
Hello,
I have a question about how page titles are escaped in the pagecounts
dumps as found at
http://dumps.wikimedia.org/other/pagecounts-all-sites/ and
http://dumps.wikimedia.org/other/pagecounts-raw/.
I'm wondering for a particular page title, what is the set of escaped
page titles in the dumps that I should look for? I searched for all
possible combinations of unescaped and escaped characters in the page
title and found the ones with non-zero counts.
The examples below are from the 2015-01-01T02 dump.
On the "ru" domain, the page "Путин, Владимир Владимирович" for has an
encoded page title:
"%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87"
(underscores replace spaces and every character but the comma is
escaped).
On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page titles:
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29"
(everything except comma escaped)
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)"
(everything except comma and parens escaped)
"Мстители_(фильм,_2012) (nothing escaped)".
On the "en" domain, the page "Spider-Man (2002 film)" has escaped page titles:
"Spider-Man_%282002_film%29" (everything except parens escaped)
"Spider-Man_(2002_film)" (nothing escaped)
Is the logic for the escaping available somewhere?
Thanks,
Bo
Forwarding.
Pine
---------- Forwarded message ----------
From: Rachel Farrand <rfarrand(a)wikimedia.org>
Date: Fri, Feb 5, 2016 at 4:59 PM
Subject: [Wikitech-l] Tech Talk: A Hands-on Estimation Exercise, With
Discussion: Feb 8th
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Please join for the following tech talk:
*Tech Talk**:* A Hands-on Estimation Exercise, With Discussion
*Presenter:* Joel Aufrecht
*Date:* February 8th, 2016
*Time: *18:30 UTC
<
http://www.timeanddate.com/worldclock/fixedtime.html?msg=Tech+Talk%3A+A+Han…
>
Link to live YouTube stream <http://www.youtube.com/watch?v=b-zLwTez46M>
*IRC channel for questions/discussion:* #wikimedia-office
*Summary: *Estimation is an unnatural activity for human brains, which tend
to hide our own ignorance from us. This brown-bag begins with an exercise,
adapted from Steve McConnell's software estimation training, in balancing
accuracy with precision. The exercise is fully available to remotees.
Facilitated discussion follows, on what we can learn from the exercise and
on general estimation and forecasting topics as raised.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> Date: Thu, 4 Feb 2016 08:22:01 +0100
> From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com>
> To: A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and "analytics."
> <analytics(a)lists.wikimedia.org>
> Subject: Re: [Analytics] Pagecounts dumps page title UTF-8 escaping
> Message-ID: <56B2FC19.6090105(a)gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Bo Han, 04/02/2016 00:40:
>> Is the logic for the escaping available somewhere?
>
> MediaWiki API does https://phabricator.wikimedia.org/T29849
> For the new pageviews API I got this reply on Unicode normalisation:
> https://phabricator.wikimedia.org/T44259#1351880
>
> (Phabricator is down right now; wait a couple hours or check
> web.archive.org.)
>
> Nemo
Thanks for the reply Nemo. I read over the two links but am still a
little confused about the case for "Мстители (фильм, 2012)" on domain
ru, which is escaped as:
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29"
(everything but comma escaped)
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)"
(everything but comma+parens escaped)
"Мстители_(фильм,_2012)" (nothing escaped)
Shouldn't the comma and parens be escaped as well, or is there a
special case for reserved characters? If so, why are parens sometimes
escaped and sometimes not? Maybe some of the variation has to do with
how browsers encode/send the request?
Bo
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can
continue with the next steps.
What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from
bots. We can recognize a part of that traffic with regular expressions[1],
but we can not recognize all of it, because some bots do not identify
themselves as such. If we could identify a greater part of the bot traffic,
we could also better isolate the human traffic and permit more accurate
analyses.
Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.
How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the
course of this discussion we can adjust the convention's definition and, if
no major concerns are raised, in 2 weeks we'll create a documentation page
in Wikitech, send an email to the proper mailing lists and maybe write a
blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia
content.
[1]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
[2] https://www.mediawiki.org/wiki/Manual:Bots
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
Hi Analytics folks,
My understanding is that the new pageview definition, which excludes
automata to a certain extent, is now published. I have a few questions:
1. Is stats.grok.se already transitioned to the new definition, or will it?
2. Is there a replacement for stats.grok.se planned or already available? A
reliable substitute would be great, and it would be nice if we could either
replace the existing on-wiki "page view statistics" link or add a
supplemental link to the new resource.
Apologizes if this information was already published and I missed it.
Thanks,
Pine
Hi Analytics fellows,
We are experiencing issues with loading data into the hadoop cluster,
therefore blocking the full job pipeline.
When fixed, the cluster will be heavily loaded trying to catch up, so
please, be nice with it and don't run heavy jobs in the next hours.
We'll keep you posted about resolution.
Many thanks, and sorry for the inconvenience.
Joseph