(This is a little note I have meant to write for a while. Sending it both a
heads-up for other people who work with this data - many may have
encountered some of these issues, but not everybody may be aware of all of
them - and a contribution to the discussion about the Analytics team's
"operational excellence" quarterly goal
<https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analytics>
for
Q3.)
So, EventLogging has been a highly useful part of our analytics
infrastructure for years now, critical for the work of many teams. However,
over the course of this year there have been several longstanding issues
that make me wonder if we are giving it enough attention
infrastructure-wise.
1.
https://phabricator.wikimedia.org/T146840 Major loss of events in many
different schemas, apparently differing by browser family. This affected
e.g. one of the main metrics we've been using to evaluate hovercards (page
previews) in the reading Web team and was the reason we had to restrict the
analysis of recent A/B tests there to Firefox only. It also created
confusion for users of the Discovery department's mobile search dashboard
and affected the Edit schema as well. No reaction on the task from
Analytics since September 28.
2.
https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by
people from the Reading web team and myself around April/May. The effect on
query results is small in most cases, but significant in some, and in any
case does not raise confidence in the quality of the data - we would at
least like to know what the most likely explanations are. No reaction from
Analytics since August, despite four "The World Burns" tokens by other data
analysts and a reminder from Reading management.
3. "ERROR 2013 (HY000): Lost connection to MySQL server during query" and
"ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL
data from stat1003. Happening infrequently but often enough to be a major
nuisance at times. (I haven't filed a Phabricator task for this yet, but
brought it up on IRC various times. Arguably a more database/service
quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more
sporadically (and in some cases still need to file Phabricator tasks for),
but these are some of the most important.
I am wondering whether this list may be a better venue for raising
awareness when things get stale on Phabricator.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB