[Labs-l] A tale of three databases

Tue Sep 23 16:14:19 UTC 2014

Thank you for the postmortem! Such often contain very valuable lessons, and
I am glad you chose to write it down and share it so openly. This deserves
kudos!

On Tue, Sep 23, 2014 at 6:46 AM, Marc A. Pelletier <marc at uberbox.org> wrote:

> [Or; an outage report in three acts]
>
> So, what happened over the last couple of days that have caused so many
> small issues with the replica databases?  In order to make that clear,
> I'll explain a bit how the replicas are structured.
>
> At the dawn of time, the production replicas were set up as a
> small-scale copy of how production itself is set up, with the various
> project DBs split in seven "slices" to spread load.  Those seven slices
> ran on three (physical) servers, and each held a replica of its
> production equivalent.  (This is what everyone saw as "s1" - "s7").
>
> Now, in order to allow tools that don't understand that more than one
> database can live on the same physical server to work without needing
> adaptation (to ease transition from the toolserver), I set up a set of
> ugly networking rules[1] that made those three servers appear to be
> seven different ones - allowing code to pretend that just changing the
> address gets you to a different server.
>
> Enter MariaDB.
>
> Now, MariaDB is a very nice improvement for everyone: not only does it
> allow us to mirror /every/ slice on all three servers (allowing easy
> joins between databases), but it does so faster and more reliably than
> vanilla mysql could thanks to a new database engine (TokuDB).  What this
> meant is that we no longer needed to run seven mysql instances but just
> one per server, each having a copy of every production database.
>
> So, Sean (our DBA), set about converting our setup to MariaDB and
> merging the databases that used to live on every server to a single one.
>  This worked well, with only minor problems caused by some slight
> behaviour differences between mysql and mariadb or between innodb (the
> previous database engine) and tokudb.  Two of the servers were completed
> that way with the third soon to be done once the kinks were worked out[2].
>
> Fast forward several weeks and a second, unrelated issue was on the
> plate to fix.  You see, of the three database servers one had been set
> up in the wrong place in the datacenter[3]; it worked, but because it
> was there it kept needed special exceptions in the firewall rules which
> was not only a maintenance issue, but was error prone and less secure.
>
> Fixing /that/ would be a simple thing; it only needs a short downtime
> while someone actually physically hauls the hardware from one place in
> the datacenter to another; and change its IP address.
>
> That went well, and in less than an hour the database was sitting
> happily in its new rack with its new IP address.
>
> Now, at that point, the networking configuration needs to be changed
> anyways, and since the databases had been merged[4], it was obvious that
> this was the right time to rip out the ugly networking rules that had
> become noops and by now just added a layer of needless complexity.
>
> That also went well, except for one niggling detail[5]: the databases on
> the third servers never /did/ get merged like the other two.  Removing
> the networking rules had no effect on the first two (as expected) but
> now only the first of three databases on the third was accessible.
>
> Worse: it *looks* like the other two databases are still happily working
> since you apparently can still connect to them (but end up connected to
> the wrong one).
>
> So, the change is made accompanied with some tests and all seems fine,
> because, out of the dozen or so project databases I tested, I didn't
> happen to test connecting to a database that used to be on the two out
> of seven slices that are no longer visible.
>
> Monday comes, panic ensues.  In the end, we decided to merge the
> databases on the third server as the fix (that took around a day), and
> we're back to working status with everything done.
>
> Like all good tales, this one has a moral[6].  No change is so obvious
> that it doesn't require careful planning.  The disruption over the
> weekend was due only to the fact that I didn't take the time to double
> check my assumptions because the change was "trivial".
>
> Or, as I learned while wiping the egg from my face, would have *been*
> trivial if my assumptions matched reality.
>
> Exit sysadmin stage left, head hung low in shame at his hubris exposed.
>
> -- Marc
>
> [1] The "iptable rules" you may have heard mentionned on occasions.
> Basically, just a set of NAT rules to redirect faux IPs standing in for
> the servers to the right IP and port.
>
> [2] Pay attention here, that's some skillful foreshadowing right there.
>
> [3] Moved from one row of eqiad to another, for those keeping score.
>
> [4] If you've been following at home, you already see where this is
> heading.
>
> [5] Also, the change was done on a Friday.  "But it's just a trivial
> change!"
>
> [6] Well, two morals if you count the "Don't do a change before you
> leave for the weekend!" beating I also gave myself.
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20140923/26be4be4/attachment.html>