[Labs-l] update on labsdb replica sync issues

Sean Pringle springle at wikimedia.org
Mon Nov 3 22:27:46 UTC 2014


Recently we've seen the issue of labsdb replicas falling out of sync with
upstream, which has been tricky to debug. At different times there have
been:

- Duplicate key errors
- Missing key errors
- The above errors sometimes irreparable by ALTER or resync
- A few out-of-memory (OOM) events killing mysqld outright

The OOM issues occur when slow, poorly optimized queries touching large
amounts of data are allowed to run for too long (usually days or weeks).
Previously we've tried to be hands-off and only kill these when absolutely
necessary, but it is clear we'll have to impose some automated time and
memory limits.

The observant reader will note that even OOM should not affect an RDBMS
properly using ACID and transactions, and that is entirely correct. I
include it because it isn't clear how OOM and subsequent replication
restarts would have interacted with MDEV-6589 (below).

As for the replication problems, the following have all been relevant at
one point or another, interacting together to produce some weird results:

https://mariadb.atlassian.net/browse/MDEV-6551
(affected us prior to MariaDB 10.0.14)

https://mariadb.atlassian.net/browse/MDEV-6589
(related to 6551; potentially affected us prior to a configuration change,
but only in theory)

https://tokutek.atlassian.net/browse/DB-739
(still affects us on MariaDB 10.0.14 and TokuDB 7.5.0)

So, stuff to do:

1. We need some memory and time limits for user queries. Memory usage is
easy to track server-side on a per-client basis, but users may find it
difficult to predict or understand why specific queries trip some arbitrary
memory limit. So, just time based? Thoughts?

2. The TokuDB bug DB-739 appears only on specific types of upstream
transaction, so some replica tables (including but not necessarily limited
to *links, user, recentchanges, and geo_tags) are being switched back to
InnoDB until further notice.

3. After #2 we resync across the board, yet again.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20141104/f740d4b3/attachment.html>


More information about the Labs-l mailing list