[Engineering] Fwd: [Analytics] EventLogging MySQL Schema Whitelist

Adam Baso abaso at wikimedia.org
Tue Sep 11 14:39:53 UTC 2018


Cross-post.

---------- Forwarded message ---------
From: Andrew Otto <otto at wikimedia.org>
Date: Tue, Sep 11, 2018 at 9:31 AM
Subject: [Analytics] EventLogging MySQL Schema Whitelist
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics. <analytics at lists.wikimedia.org>,
Internal discussion of WMF Research Team <
research-internal at lists.wikimedia.org>, Product Analytics <
product-analytics at wikimedia.org>


Hi all you EventLogging users out there!

tl;dr
We will switch EventLogging MySQL ingestion to be based on a schema
whitelist rather than blacklist.

As you know, we currently import EventLogging events into two locations for
analysis: The MySQL ‘log’ database, and the Hive ‘event’ database.  MySQL
is not able to handle high volume events.  We currently blacklist
<https://github.com/wikimedia/puppet/blob/production/hieradata/common.yaml#L362>
any schemas that we know have high volumes from being ingested into MySQL.
This can cause problems when a new high volume schema is deployed,  as it
requires knowledge and communication from the schema owners to the
Analytics team, and it requires an Analytics Operations engineer to make a
Puppet commit to blacklist the schema.  To address this problem, we will
switch the EventLogging MySQL schema blacklist to a whitelist.  All schemas
that are actively being ingested into MySQL today will be whitelisted.  In
the future, if you want an event schema to be ingested into MySQL, you’ll
need to ask the Analytics team to whitelist it.

Hive has been working for EventLogging analysis for a while now.  It has
almost all of the schemas that MySQL does, plus the high volume ones.  One
day in the (distant?) future, we’d like to decommission MySQL storage of
events. (Don’t worry yet, MySQL decommissioning has a lot of blockers and
this work is not planned.)  By not ingesting events into MySQL by default,
we hope to encourage more users to switch to Hive.

This switch to a whitelist will happen this week.  If you are deploying a
new schema and expect it to show up in MySQL, let us know so we can
whitelist it!

Thanks!
-Andrew + the Analytics Engineering team

https://phabricator.wikimedia.org/T203596

_______________________________________________
Analytics mailing list
Analytics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20180911/e5735ad5/attachment.html>


More information about the Engineering mailing list