Hi folks,
As you've probably heard, last week we deployed ulsfo in production,
reducing latency for Oceania, East/Southeast Asia & US/Canada
pacific/west coast states. My estimation of the user base affected by
this is 360 million users (as in, Internet users, not Wikipedia users).
I was wondering if you have an easy way to measure and plot the impact
in page load time, perhaps using Navigation Timing data?
The operations team has spent a considerable amount of time and money to
deploy ulsfo and I believe it'd be useful for us and the organization at
large to be able to quantify this effort.
The exact dates of the rollout by country/region codes can be found in
operations/dns' git history:
https://git.wikimedia.org/summary/?r=operations/dns.git
(the commits should be self-explanatory, but I'd be happy to clarify if
needed)
Thanks!
Faidon
At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming
connections. At some point during the subsequent hour, MariaDB had either
crashed or been manually restarted. Sean noticed that the database was
choking on some queries from the researchers and notified the wmfresearch
list.
During the time that the database server was out or rejecting connection,
the EventLogging writer that writes to db1047 was repeatedly failing to
connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect
to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed to
connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert:
[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs
on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs:
consumer/mysql-db1047
This alert was not responded to. I finally got pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who noticed
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not
proportional to its usage across the organization. Analytics, I really need
you to step up your involvement.
It was not long ago that EventLogging was running reliably for months at a
time. What has changed is not system load, but the owner seat becoming
vacant, leading to a gradual deterioration of the quality of monitoring and
auditing practices.
Sean proposed moving the EventLogging database to m2, so that it runs on
separate hardware from the research databases. I think he's right. I filed <
https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the
migration.
There is some code rot around the Ganglia and Graphite monitoring code for
EventLogging. I don't think it would take much to fix. Could the Analytics
team take this on?
The Puppet code is well-documented. <
https://wikitech.wikimedia.org/wiki/EventLogging> could use some updating,
but it is mostly current.
Finally, I think EventLogging Icinga alerts should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
vanadium:/var/log/upstart/eventlogging-*.
---
Ori Livneh
ori(a)wikimedia.org
Add Analytics to cc, as I think they'll be interested as well :)
On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda <yuvipanda(a)gmail.com> wrote:
> Hello!
>
> We are getting closer to a general release of the Wikipedia Android
> and iOS apps, and I think we should standardize on a User-Agent
> format. The old app just appended an identifier in front of the
> phone's default UA[1] but I think we can do better, to avoid privacy
> concerns[2].
>
> How about:
>
> WikipediaApp/<version> <OS>/<form-factor>/<version>
>
> This gives us all the info we need (App version, OS, Form Factor
> (Tablet / Phone) and OS version) without giving away too much. It is
> also fairly simple to construct and parse.
>
> For the latest alpha, my Nexus 4 would generate
>
> WikipediaApp/32 Android/Phone/4.4
>
> While an iOS device might generate
>
> WkipediaApp/2.0 iOS/Phone/7.1
>
> form-factor would just be Phone|Tablet for now, and can be expanded
> later if necessary.
>
> Thoughts?
>
> [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps
> [2]: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
> --
> Yuvi Panda T
> http://yuvi.in/blog
--
Yuvi Panda T
http://yuvi.in/blog
I got the impression from this discussion that the mobile apps aren't
currently in use so the CUs have had no experience working with them. It sounds like I was mistaken.
Toby, what UA data do CUs currently see from edits made through the mobile apps?
CUs, is the information that you're currently getting from edits from the new mobile apps a good balance the concerns raised previously in this discussion?
Pine
> Date: Thu, 27 Mar 2014 15:36:16 -0700
> From: Toby Negrin <tnegrin(a)wikimedia.org>
> To: "A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics."
> <analytics(a)lists.wikimedia.org>
> Cc: mobile-l <mobile-l(a)lists.wikimedia.org>, Oliver Keyes
> <ironholds(a)gmail.com>, Yuvi Panda <yuvipanda(a)gmail.com>
> Subject: Re: [Analytics] [WikimediaMobile] Mobile and CheckUser (Was
> [mobile-l] Wikipedia App User Agents)
> Message-ID:
> <CAAjh0EyqpFDke3P6Q5FxecOnE5yc7uSs_VS9HcB5khoxZ6-Yng(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hey folks -- we aren't considering changing any of the data that goes into
> checkuser. That tool will be unchanged.
>
> This discussion only concerns backend logging EventLogging and page view
> analytics.
>
> thanks,
>
> -Toby
>
Forking since I think there are two conversations - one about the
format of UA for the mobile apps and one about CheckUser requirements
for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading
https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info
CUs have in helping prevent abuse. IP is already sortof useless with
mobile networks - a lot of providers do NAT and similar things that
mean that we can not remotely close to reasonably assume 1 IP = 1
User, or anything remotely similar to that. UA provides more
fingerprinting ability, but CU isn't the only thing that consumes UA -
other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only
users making edits (no read actions!) for CU. I am sure that can be
implemented without having to have a very fingerprintable UA, with
simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and
disconnected completely from every other device identifier) that we
send only with edits (and other 'POST' actions) as a separate header.
This can be processed by CU (perhaps with a hook that
Extension:MobileApp can hook into) and then used by CheckUsers. This
data will be treated with the same data retention / privacy policy
that applies to CUs now, and regular UA data can be consumed by other
consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a
problem, and the app side of the issue is rather simple too. Deskana
(speaking solely as a volunteer CU) says that this solution is
acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> Repost, because filtering; there might be a point of confusion here that's
> causing the problem. As I understand it, the user agent sanitisation is
> expected to apply to EventLogging data, and data in the Analytics pipeline,
> but not data streaming into MediaWiki proper - namely, the cu_changes table.
> Nuria, is that the case?
>
>
> On 27 March 2014 08:16, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>>
>> >Rather than having an ethical debate over it, we could always test the
>> > actual usefulness with Science. That way we'd be able to see how much
>> > granularity each additional component adds to the data.
>> I kind of feel we are going backwards as we throughly discussed this
>> point, technical info and references regarding entropy and user agents and
>> fingerprinting can be found here:
>> https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>>
>>
>>
>> On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes <okeyes(a)wikimedia.org>
>> wrote:
>>>
>>> +1. I'm totally down for keeping less information around, but if it gets
>>> in the way of people doing their job?
>>>
>>> Rather than having an ethical debate over it, we could always test the
>>> actual usefulness with Science. That way we'd be able to see how much
>>> granularity each additional component adds to the data.
>>>
>>>
>>> On 27 March 2014 07:15, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
>>>>>
>>>>> Including more information on the UA, while being covered by legal
>>>>> under the new privacy policy, really goes agains the wishes of the community
>>>>> as they do not wish to be finger printed.
>>>>
>>>>
>>>> I don't think that "the wishes of the community" have been established
>>>> and the whole point of checkuser is that it allows for fingerprinting.
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>>>>>
>>>>>
>>>>> >As a checkuser, user agents are an important part of my workflow for
>>>>> > identifying that multiple accounts are owned by the same person.
>>>>> > So I'm going to have to argue for including more information in the
>>>>> > user agent.
>>>>>
>>>>> Including more information on the UA, while being covered by legal
>>>>> under the new privacy policy, really goes agains the wishes of the community
>>>>> as they do not wish to be finger printed.
>>>>> See:
>>>>> https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitization or
>>>>> https://meta.wikimedia.org/wiki/Talk:Privacy_policy
>>>>> There has been plenty more discussions about this on analytics e-mail
>>>>> list.
>>>>>
>>>>>
>>>>> >Your proposed user agent would basically mean that every single person
>>>>> > using the most up-to-date version of the app on a particular platform would
>>>>> > >be indistinguishable from each other. This would, unfortunately, lead to
>>>>> > lots of innocent users getting blocked as sockpuppets.
>>>>>
>>>>> However, note that the UA " WikipediaApp/<version>
>>>>> <OS>/<form-factor>/<version>" clearly satisfies the use case of the mobile
>>>>> team. It provides as much information as they need from their user without
>>>>> sending any private data.
>>>>>
>>>>> Can you please list what is your use case? Namely how are you
>>>>> identifying "false" accounts. Perhaps relying on the user agent to do so is
>>>>> not the best strategy going forward. Have in mind that with the old privacy
>>>>> policy UA data needed to be discarded after 90 days. With the new policy
>>>>> there is more legal room but given community feedback analytics team is
>>>>> planning on aggregating all UA information in the future. This means that UA
>>>>> data will not be stored (or reported) per user or request but rather
>>>>> agreggated (as in "4% of users use iPhone").
>>>>>
>>>>> We gathered recently information from all teams as to use cases
>>>>> pertaining UA data collection:
>>>>>
>>>>> https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDa….
>>>>>
>>>>> Let's talk about your use case and add it to the document that already
>>>>> exists describing usages of user agent data, this document was sent out to
>>>>> all teams couple months ago but there is no description of your use case
>>>>> there:
>>>>>
>>>>> https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEE…
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry <dgarry(a)wikimedia.org>
>>>>> wrote:
>>>>>>
>>>>>> Hey Yuvi,
>>>>>>
>>>>>> As a checkuser, user agents are an important part of my workflow for
>>>>>> identifying that multiple accounts are owned by the same person. So I'm
>>>>>> going to have to argue for including more information in the user agent.
>>>>>> Your proposed user agent would basically mean that every single person using
>>>>>> the most up-to-date version of the app on a particular platform would be
>>>>>> indistinguishable from each other. This would, unfortunately, lead to lots
>>>>>> of innocent users getting blocked as sockpuppets.
>>>>>>
>>>>>> Here's an example of a user agent from an iPhone using Safari:
>>>>>> Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw)
>>>>>> AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4
>>>>>> Safari/6533.18.5
>>>>>>
>>>>>> Look at all of that wonderful information! ;-) In general, the more
>>>>>> information you can include without breaching the user's privacy, the
>>>>>> better.
>>>>>>
>>>>>> I'd be happy to work with you on this.
>>>>>>
>>>>>> Thanks,
>>>>>> Dan
>>>>>>
>>>>>> P.S. You may also want to consult with the legal team, to ensure that
>>>>>> an unacceptable levels of private information are not given out. They would
>>>>>> also make a complement for me; I would likely be pulling in the direction of
>>>>>> "MOAR INFORMATION!", whereas they would likely be pulling in the direction
>>>>>> of "LESS INFORMATION!". :-)
>>>>>>
>>>>>>
>>>>>> On 26 March 2014 15:00, Yuvi Panda <yuvipanda(a)gmail.com> wrote:
>>>>>>>
>>>>>>> Add Analytics to cc, as I think they'll be interested as well :)
>>>>>>>
>>>>>>> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda <yuvipanda(a)gmail.com>
>>>>>>> wrote:
>>>>>>> > Hello!
>>>>>>> >
>>>>>>> > We are getting closer to a general release of the Wikipedia Android
>>>>>>> > and iOS apps, and I think we should standardize on a User-Agent
>>>>>>> > format. The old app just appended an identifier in front of the
>>>>>>> > phone's default UA[1] but I think we can do better, to avoid
>>>>>>> > privacy
>>>>>>> > concerns[2].
>>>>>>> >
>>>>>>> > How about:
>>>>>>> >
>>>>>>> > WikipediaApp/<version> <OS>/<form-factor>/<version>
>>>>>>> >
>>>>>>> > This gives us all the info we need (App version, OS, Form Factor
>>>>>>> > (Tablet / Phone) and OS version) without giving away too much. It
>>>>>>> > is
>>>>>>> > also fairly simple to construct and parse.
>>>>>>> >
>>>>>>> > For the latest alpha, my Nexus 4 would generate
>>>>>>> >
>>>>>>> > WikipediaApp/32 Android/Phone/4.4
>>>>>>> >
>>>>>>> > While an iOS device might generate
>>>>>>> >
>>>>>>> > WkipediaApp/2.0 iOS/Phone/7.1
>>>>>>> >
>>>>>>> > form-factor would just be Phone|Tablet for now, and can be expanded
>>>>>>> > later if necessary.
>>>>>>> >
>>>>>>> > Thoughts?
>>>>>>> >
>>>>>>> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps
>>>>>>> > [2]:
>>>>>>> > https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>>>>>>> > --
>>>>>>> > Yuvi Panda T
>>>>>>> > http://yuvi.in/blog
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Yuvi Panda T
>>>>>>> http://yuvi.in/blog
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dan Garry
>>>>>> Associate Product Manager for Platform
>>>>>> Wikimedia Foundation
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics(a)lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
--
Yuvi Panda T
http://yuvi.in/blog
I think it would be good to discuss this with the Ombudsman Commission and some functionary groups to get a broad range of opinions. An RFC on Meta seems appropriate to me, and you can send invitations to functionary groups to request comment. A similar RFC was done relating to the scope of the Ombudsman Commission. Any other suggestions for global Checkuser policy updates would be good to bundle with this one. It may be good to review the CU policy in light of recent privacy policy changes.
It's challenging to have a discussion about CU policies and practices without revealing more to socks than we might want them to know. Please keep that in mind when discussing this in public.
Pine
Hi Everyone,
It's with great pleasure I'd like to introduce Kevin LeDuc, our new
Analytics Product Manager. We are really excited to have someone with
Kevin's background and experience on the team. He'll be based in San
Francisco.
I'm hopeful that this will give us more bandwidth to work with the
community on Analytics efforts.
In his own words:
My first job was writing a web browser for cell phones in Java. I then
moved on to making games in Java for cell phones. After a while I stopped
coding and took on the roles of game designer, product manager, project
manager and office manager simultaneously. A few years ago I met my wife
who convinced me to move to the bay area where I took on product/project
managing analytics and big data.
A few random facts:
- I am Canadian and born in Venezuela (my dad was a diplomat)
- My mother tongue is French and I learned to speak English at while living
in LA
- I lived a couple of years in Cote d'Ivoire (10-12 years old)
- I summited the highest mountain in South America
- iPhone games I produced: Spades King, Cribbage King and The New York
Times Crosswords
Welcome Kevin!
-Toby
Dear Toby,
I recently saw your comment on a blog
post<http://magnusmanske.de/wordpress/?p=173>by Magnus Manske
regarding the lack of Wikipedia page view data besides the
oft-overloaded http://stats.grok.se/. I was wondering if there's been any
progress at WMF on building a more stable, central, and complete source for
this data?
I ask because I'm a data scientist at a small research non-profit
called Harmony
Institute <http://harmony-institute.org/>, where we study the social impact
of media (primarily television and film). I'm currently building an
interactive web app <http://harmony-institute.org/work/impactspace/> that
visualizes social impact on a variety of issues by many documentary films.
One indicator of interest is "information-seeking behavior," i.e. are
audiences seeking out information about a film or issue. Besides Google
search trends, an excellent proxy for this is Wikipedia page views for both
film pages, e.g. Escape
Fire<http://en.wikipedia.org/wiki/Escape_Fire:_The_Fight_to_Rescue_American_Heal…>,
and issue-related pages, e.g. Health care
reform<http://en.wikipedia.org/wiki/Health_care_reform>
.
I'm currently trying to use stats.grok.se to grab raw data in JSON form;
unfortunately, the site almost always responds with "Server overloaded,
please throttle your requests," and no amount of throttling seems to
suffice. I'm aware that there are many TBs of raw data for the downloading,
but I don't have the resources to handle that much data, nor do I need more
than the tiniest fraction of it.
I would *love* to show Wikipedia page view statistics for film pages in our
app. If you have any updates on progress or suggestions on how I might do
this, I would be very appreciative.
Thanks very much for your and all of WMF's hard work -- I'm a proud donor to
the cause. :)
Best,
Burton DeWilde
--
Burton DeWilde
Data Scientist
Harmony Institute
harmony-institute.org
blog <http://harmony-institute.org/therippleeffect/> |
twitter<https://twitter.com/hinstitute>|
facebook <https://www.facebook.com/harmonyinstitute>