[Labs-l] [Analytics] Lag reporting on lab db replicas

Fri Nov 27 05:16:17 UTC 2015

On Wed, Nov 25, 2015 at 1:21 PM, Jaime Crespo <jcrespo at wikimedia.org> wrote:
> Always fearing doing queries on a lagged replica on labs? Not anymore!
>
> While Betacommand's tool [0] was very useful, it was also very inaccurate,
> as it tried to check the lag by looking at the last rows updated, which can
> be a lot of time on the least popular wikis.
>
> What I offer now is sub-second accurate lag measuring, by writing on the
> production masters the current time, in microseconds, every 0.5 seconds and
> making that available on all hosts (using this tool [1]). So, it is more
> accurate than SHOW SLAVE STATUS, because it compares the difference with the
> original master, and it will work even if replication is broken.
>
> To read it, just do SELECT * FROM heartbeat_p.heartbeat;
> And you will get:
> +-------+----------------------------+------+
> | shard | last_updated               | lag  |
> +-------+----------------------------+------+
> | s6    | 2015-11-25T20:20:32.000980 |    0 |
> | s2    | 2015-11-25T20:20:32.001030 |    0 |
> | s7    | 2015-11-25T20:20:32.001070 |    0 |
> | s3    | 2015-11-25T20:20:32.001000 |    0 |
> | s4    | 2015-11-25T20:20:32.000920 |    0 |
> | s1    | 2015-11-25T20:20:32.000740 |    0 |
> | s5    | 2015-11-25T20:20:32.000830 |    0 |
> +-------+----------------------------+------+
>
> Read the detailed documentation on: [2]
>
> Use it, create a web page if you want to make it public! Report a ticket if
> it gets too high! Report a ticket if you need more info (a record per
> wiki?). But I wanted to give you the essentials, and you can build
> yourselves on top of that.
>
> Only 2 know bugs:
> - There is microsecond accuracy, but it cannot be used until a bug in
> MariaDB is fixed [3]
> - enwiki will only report s1 lag until that server is restarted due to some
> existing filters. We will schedule that at some time in the future.
>
> [0]<http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag>
> [1]<https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html>
> [2]<https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag>
> [3]<https://mariadb.atlassian.net/browse/MDEV-9175>

I made a tool [4] that reads the heartbeat_p database on from the
server that hosts each shard and matches it with the shard for each
wiki. The tool gets all (dbname, slice) pairs from meta_p.wiki and the
slice replag from heartbeat_p.heartbeat from the server hosting each
slice and then matching them up in the table. I think I got the logic
here right, but you can view the source [5] to see if you agree.

[4]: https://tools.wmflabs.org/replag/
[5]: https://tools.wmflabs.org/replag/?source

Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855