I will also note that the wikibugs bot in #wikimedia-analytics does actually capture tasks from Analytics-EventLogging too, i.e. over the course of this year numerous updates about these tasks have scrolled by in this channel with a subject line indicating serious data quality issues. Sure, not a formal notification, but one would have hoped it increased the chances of these tasks catching some attention (as they did from several people outside the team, including one DBA).
>
> It is very unlikely that 3 months after the fact we would know what happen
> on EL on September 9th, we do not retain neither operational logs nor data
> logs that long so probably that ticket would be closed w/o resolution cause
> we did not learn about it promptly enough.
>
> Thanks,
>
> Nuria
>
>
>
>
>
>
>
>
>
>
> On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu <
dandreescu@wikimedia.org>
> wrote:
>>
>> Thanks for the email, Tilman. I will read it in depth and look closely at
>> the issues, but I want to point out something majorly important:
>>
>> *** We are NOT certain to see tasks unless they're tagged with
>> "ANALYTICS". We have an outstanding ask from the phab team and upstream to
>> solve issues that will help us get around this limitation. But for the
>> meantime, if you want us to see a task you MUST tag it with Analytics. ***
>>
>> As a result, I personally didn't see these tasks until your email just
>> now. I hope my instant response and reaction will help prove that I take
>> them seriously. I have tagged those tasks with Analytics and also put them
>> in our working board to give them immediate priority.
>>
>> p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during
>> query" errors are, as far as I understand, just time-outs that help the DBA
>> teams manage performance on the analytics servers. I have never seen them
>> affect results, and wikimetrics has a way of actively waking up connections
>> that die in this way.
>>
>> On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer <
tbayer@wikimedia.org>
>> wrote:
>>>
>>> (This is a little note I have meant to write for a while. Sending it both
>>> a heads-up for other people who work with this data - many may have
>>> encountered some of these issues, but not everybody may be aware of all of
>>> them - and a contribution to the discussion about the Analytics team's
>>> "operational excellence" quarterly goal for Q3.)
>>>
>>> So, EventLogging has been a highly useful part of our analytics
>>> infrastructure for years now, critical for the work of many teams. However,
>>> over the course of this year there have been several longstanding issues
>>> that make me wonder if we are giving it enough attention
>>> infrastructure-wise.
>>>
>>> 1.
https://phabricator.wikimedia.org/T146840 Major loss of events in many
>>> different schemas, apparently differing by browser family. This affected
>>> e.g. one of the main metrics we've been using to evaluate hovercards (page
>>> previews) in the reading Web team and was the reason we had to restrict the
>>> analysis of recent A/B tests there to Firefox only. It also created
>>> confusion for users of the Discovery department's mobile search dashboard
>>> and affected the Edit schema as well. No reaction on the task from Analytics
>>> since September 28.
>>>
>>> 2.
https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
>>> EventLogging rows, a longterm issue first observed, independently, by people
>>> from the Reading web team and myself around April/May. The effect on query
>>> results is small in most cases, but significant in some, and in any case
>>> does not raise confidence in the quality of the data - we would at least
>>> like to know what the most likely explanations are. No reaction from
>>> Analytics since August, despite four "The World Burns" tokens by other data
>>> analysts and a reminder from Reading management.
>>>
>>> 3. "ERROR 2013 (HY000): Lost connection to MySQL server during query" and
>>> "ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL
>>> data from stat1003. Happening infrequently but often enough to be a major
>>> nuisance at times. (I haven't filed a Phabricator task for this yet, but
>>> brought it up on IRC various times. Arguably a more database/service quality
>>> issue, but I'm not certain it can't affect query results as well.)
>>>
>>> There are various other EL issues I have been encountering more
>>> sporadically (and in some cases still need to file Phabricator tasks for),
>>> but these are some of the most important.
>>>
>>> I am wondering whether this list may be a better venue for raising
>>> awareness when things get stale on Phabricator.
>>> --
>>> Tilman Bayer
>>> Senior Analyst
>>> Wikimedia Foundation
>>> IRC (Freenode): HaeB
>>>
>>> ______________________________
_________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB