Thanks for looking into these issues, folks, really appreciated!
On Fri, Dec 16, 2016 at 11:35 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
I was about to send similar e-mail, unless you tag us
with Analytics we
will
not see the issues.
Indeed, the tasks were not tagged with "Analytics" - only with
"Analytics-EventLogging." Thanks to you and Dan for explaining the process
and the Phabricator issue that prevents the team from seeing new tasks on
"Analytics-EventLogging" - I will try to remember to add everything to
"Analytics" from now on. While we are waiting for Phacility to fix this
upstream, how about documenting this somewhere, for example by adding a
warning to the Analytics-Eventlogging board
<https://phabricator.wikimedia.org/project/profile/589/> that it is not
being watched?
Thus, if issues are important enough and we have not
responded (for the most part we respond quite fast to operational issues)
DO
ping us on irc. That is what the channel is for.
But Nuria, Jon had pinged you directly at
https://phabricator.wikimedia.
org/T142667#2555765 four months ago. Are you saying that personal
Phabricator pings are also not a reliable way to get the Analytics team's
attention?
In any case, I'm a regular IRC user and know how to find you folks there,
so yes, I can go that extra step too (in this case, I chose another public
venue, this list, as do other people to get the team's attention). I'll
note though that these tasks were not "operational issues" like say an
analytics server outage that needs immediate attention and justifies
interrupting the team's work - but also not something that should have been
left unattended for many months.
I will also note that the wikibugs bot in #wikimedia-analytics does
actually capture tasks from Analytics-EventLogging too, i.e. over the
course of this year numerous updates about these tasks have scrolled by in
this channel with a subject line indicating serious data quality issues.
Sure, not a formal notification, but one would have hoped it increased the
chances of these tasks catching some attention (as they did from several
people outside the team, including one DBA).
It is very unlikely that 3 months after the fact we would know what happen
on EL on September 9th, we do not retain neither operational logs nor data
logs that long so probably that ticket would be closed w/o resolution
cause
we did not learn about it promptly enough.
Thanks,
Nuria
On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
>
> Thanks for the email, Tilman. I will read it in depth and look closely
at
> the issues, but I want to point out something
majorly important:
>
> *** We are NOT certain to see tasks unless they're tagged with
> "ANALYTICS". We have an outstanding ask from the phab team and upstream
to
> solve issues that will help us get around this
limitation. But for the
> meantime, if you want us to see a task you MUST tag it with Analytics.
***
>
> As a result, I personally didn't see these tasks until your email just
> now. I hope my instant response and reaction will help prove that I take
> them seriously. I have tagged those tasks with Analytics and also put
them
> in our working board to give them immediate
priority.
>
> p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during
> query" errors are, as far as I understand, just time-outs that help the
DBA
> teams manage performance on the analytics servers.
I have never seen
them
> affect results, and wikimetrics has a way of
actively waking up
connections
> that die in this way.
>
> On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer <tbayer(a)wikimedia.org>
> wrote:
>>
>> (This is a little note I have meant to write for a while. Sending it
both
>> a heads-up for other people who work with this
data - many may have
>> encountered some of these issues, but not everybody may be aware of all
of
>> them - and a contribution to the discussion
about the Analytics team's
>> "operational excellence" quarterly goal for Q3.)
>>
>> So, EventLogging has been a highly useful part of our analytics
>> infrastructure for years now, critical for the work of many teams.
However,
>> over the course of this year there have been
several longstanding issues
>> that make me wonder if we are giving it enough attention
>> infrastructure-wise.
>>
>> 1.
https://phabricator.wikimedia.org/T146840 Major loss of events in
many
>> different schemas, apparently differing by
browser family. This affected
>> e.g. one of the main metrics we've been using to evaluate hovercards
(page
>> previews) in the reading Web team and was the
reason we had to restrict
the
>> analysis of recent A/B tests there to Firefox
only. It also created
>> confusion for users of the Discovery department's mobile search
dashboard
>> and affected the Edit schema as well. No
reaction on the task from
Analytics
>> since September 28.
>>
>> 2.
https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
>> EventLogging rows, a longterm issue first observed, independently, by
people
>> from the Reading web team and myself around
April/May. The effect on
query
>> results is small in most cases, but
significant in some, and in any case
>> does not raise confidence in the quality of the data - we would at least
>> like to know what the most likely explanations are. No reaction from
>> Analytics since August, despite four "The World Burns" tokens by other
data
>> analysts and a reminder from Reading
management.
>>
>> 3. "ERROR 2013 (HY000): Lost connection to MySQL server during query"
and
>> "ERROR 2006 (HY000): MySQL server has
gone away" when trying to query EL
>> data from stat1003. Happening infrequently but often enough to be a
major
>> nuisance at times. (I haven't filed a
Phabricator task for this yet, but
>> brought it up on IRC various times. Arguably a more database/service
quality
>> issue, but I'm not certain it can't
affect query results as well.)
>>
>> There are various other EL issues I have been encountering more
>> sporadically (and in some cases still need to file Phabricator tasks
for),
but these are some of the most important.
I am wondering whether this list may be a better venue for raising
awareness when things get stale on Phabricator.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB