[Labs-l] Lag reporting on lab db replicas

Jaime Crespo jcrespo at wikimedia.org
Thu Nov 26 08:19:36 UTC 2015


> So even if the replicas don't get updated the heartbeat will report them
as up to date?

Not sure exactly what you mean with that. The masters will be updated
continuously every 0.5 seconds (all slaves are read only- no writes are
done there). If replication works, and slaves get updated, that will mean
that they will receive the heartbeat with the same replication channel than
the rest of the updates. If replication doesn't work, and replicas do not
get updated, they will not receive the heartbeat either, as it comes from
replication in order. If replication stops/fails, heartbeat update will
stop (from the slave perspective), and lag will start to increase from your
perspective (difference between last timestamp written and current time).

This measures the replication lag (aka difference with the master), not the
last time an edit was done by a user, which was what the first link I sent
measured. In other words, if jaimewiki receives only user edits every hour,
heartbeat will still do a write to its master every half a seconds, thus
proving that it is up to date with that resolution. You can still check the
last user edit by checking recentchanges.

The only reason this could fail (heartbeat updated but wiki not) is if
there was a specific filter denying replication but allowing hearbeat, only
done for specific tables and private wikis. Also the production master
could have a problem, but that would affect the wikis itselves, not only
labs.

To give you an idea of the accuracy of this method, we (will) use it on
production to decide if a slave is usable or not to return up-to-date data.

For more information on how this works, check <
https://www.percona.com/doc/percona-toolkit/2.1/pt-heartbeat.html#description
>

On Wed, Nov 25, 2015 at 9:51 PM, Ricordisamoa <ricordisamoa at openmailbox.org>
wrote:

> Il 25/11/2015 21:21, Jaime Crespo ha scritto:
>
> Always fearing doing queries on a lagged replica on labs? Not anymore!
>
> While Betacommand's tool [0] was very useful, it was also very inaccurate,
> as it tried to check the lag by looking at the last rows updated, which can
> be a lot of time on the least popular wikis.
>
> What I offer now is sub-second accurate lag measuring, by writing on the
> production masters the current time, in microseconds, every 0.5 seconds and
> making that available on all hosts (using this tool [1]). So, it is more
> accurate than SHOW SLAVE STATUS, because it compares the difference with
> the original master, and it will work even if replication is broken.
>
>
> So even if the replicas don't get updated the heartbeat will report them
> as up to date?
>
>
> To read it, just do SELECT * FROM heartbeat_p.heartbeat;
> And you will get:
> +-------+----------------------------+------+
> | shard | last_updated               | lag  |
> +-------+----------------------------+------+
> | s6    | 2015-11-25T20:20:32.000980 |    0 |
> | s2    | 2015-11-25T20:20:32.001030 |    0 |
> | s7    | 2015-11-25T20:20:32.001070 |    0 |
> | s3    | 2015-11-25T20:20:32.001000 |    0 |
> | s4    | 2015-11-25T20:20:32.000920 |    0 |
> | s1    | 2015-11-25T20:20:32.000740 |    0 |
> | s5    | 2015-11-25T20:20:32.000830 |    0 |
> +-------+----------------------------+------+
>
> Read the detailed documentation on: [2]
>
> Use it, create a web page if you want to make it public! Report a ticket
> if it gets too high! Report a ticket if you need more info (a record per
> wiki?). But I wanted to give you the essentials, and you can build
> yourselves on top of that.
>
> Only 2 know bugs:
> - There is microsecond accuracy, but it cannot be used until a bug in
> MariaDB is fixed [3]
> - enwiki will only report s1 lag until that server is restarted due to
> some existing filters. We will schedule that at some time in the future.
>
> [0]<http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag>
> [1]<https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html>
> [2]<
> https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag
> >
> [3]<https://mariadb.atlassian.net/browse/MDEV-9175>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
>
> _______________________________________________
> Labs-l mailing listLabs-l at lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>


-- 
Jaime Crespo
<http://wikimedia.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20151126/2773df4a/attachment.html>


More information about the Labs-l mailing list