I was about to send similar e-mail, unless you tag us with Analytics we will not see the issues. Thus, if issues are important enough and we have not responded (for the most part we respond quite fast to operational issues) DO ping us on irc. That is what the channel is for.  

It is very unlikely that 3 months after the fact we would know what happen on EL on September 9th, we do not retain neither operational logs nor data logs that long so probably  that ticket would be closed w/o resolution cause we did not learn about it promptly enough. 

Thanks,

Nuria










On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
Thanks for the email, Tilman.  I will read it in depth and look closely at the issues, but I want to point out something majorly important:

*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS".  We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation.  But for the meantime, if you want us to see a task you MUST tag it with Analytics. ***

As a result, I personally didn't see these tasks until your email just now.  I hope my instant response and reaction will help prove that I take them seriously.  I have tagged those tasks with Analytics and also put them in our working board to give them immediate priority.

p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during query" errors are, as far as I understand, just time-outs that help the DBA teams manage performance on the analytics servers.  I have never seen them affect results, and wikimetrics has a way of actively waking up connections that die in this way.

On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer <tbayer@wikimedia.org> wrote:
(This is a little note I have meant to write for a while. Sending it both a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all of them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal for Q3.)

So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams. However, over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.

1. https://phabricator.wikimedia.org/T146840 Major loss of events in many different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards (page previews) in the reading Web team and was the reason we had to restrict the analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search dashboard and affected the Edit schema as well. No reaction on the task from Analytics since September 28.

2. https://phabricator.wikimedia.org/T142667 Duplicate (spurious) EventLogging rows, a longterm issue first observed, independently, by people from the Reading web team and myself around April/May. The effect on query results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other data analysts and a reminder from Reading management.

3. "ERROR 2013 (HY000): Lost connection to MySQL server during query" and "ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a major nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service quality issue, but I'm not certain it can't affect query results as well.)

There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks for), but these are some of the most important.
 
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics