[Labs-l] A tale of three databases

Tue Sep 23 13:46:47 UTC 2014

[Or; an outage report in three acts]

So, what happened over the last couple of days that have caused so many
small issues with the replica databases?  In order to make that clear,
I'll explain a bit how the replicas are structured.

At the dawn of time, the production replicas were set up as a
small-scale copy of how production itself is set up, with the various
project DBs split in seven "slices" to spread load.  Those seven slices
ran on three (physical) servers, and each held a replica of its
production equivalent.  (This is what everyone saw as "s1" - "s7").

Now, in order to allow tools that don't understand that more than one
database can live on the same physical server to work without needing
adaptation (to ease transition from the toolserver), I set up a set of
ugly networking rules[1] that made those three servers appear to be
seven different ones - allowing code to pretend that just changing the
address gets you to a different server.

Enter MariaDB.

Now, MariaDB is a very nice improvement for everyone: not only does it
allow us to mirror /every/ slice on all three servers (allowing easy
joins between databases), but it does so faster and more reliably than
vanilla mysql could thanks to a new database engine (TokuDB).  What this
meant is that we no longer needed to run seven mysql instances but just
one per server, each having a copy of every production database.

So, Sean (our DBA), set about converting our setup to MariaDB and
merging the databases that used to live on every server to a single one.
 This worked well, with only minor problems caused by some slight
behaviour differences between mysql and mariadb or between innodb (the
previous database engine) and tokudb.  Two of the servers were completed
that way with the third soon to be done once the kinks were worked out[2].

Fast forward several weeks and a second, unrelated issue was on the
plate to fix.  You see, of the three database servers one had been set
up in the wrong place in the datacenter[3]; it worked, but because it
was there it kept needed special exceptions in the firewall rules which
was not only a maintenance issue, but was error prone and less secure.

Fixing /that/ would be a simple thing; it only needs a short downtime
while someone actually physically hauls the hardware from one place in
the datacenter to another; and change its IP address.

That went well, and in less than an hour the database was sitting
happily in its new rack with its new IP address.

Now, at that point, the networking configuration needs to be changed
anyways, and since the databases had been merged[4], it was obvious that
this was the right time to rip out the ugly networking rules that had
become noops and by now just added a layer of needless complexity.

That also went well, except for one niggling detail[5]: the databases on
the third servers never /did/ get merged like the other two.  Removing
the networking rules had no effect on the first two (as expected) but
now only the first of three databases on the third was accessible.

Worse: it *looks* like the other two databases are still happily working
since you apparently can still connect to them (but end up connected to
the wrong one).

So, the change is made accompanied with some tests and all seems fine,
because, out of the dozen or so project databases I tested, I didn't
happen to test connecting to a database that used to be on the two out
of seven slices that are no longer visible.

Monday comes, panic ensues.  In the end, we decided to merge the
databases on the third server as the fix (that took around a day), and
we're back to working status with everything done.

Like all good tales, this one has a moral[6].  No change is so obvious
that it doesn't require careful planning.  The disruption over the
weekend was due only to the fact that I didn't take the time to double
check my assumptions because the change was "trivial".

Or, as I learned while wiping the egg from my face, would have *been*
trivial if my assumptions matched reality.

Exit sysadmin stage left, head hung low in shame at his hubris exposed.

-- Marc

[1] The "iptable rules" you may have heard mentionned on occasions.
Basically, just a set of NAT rules to redirect faux IPs standing in for
the servers to the right IP and port.

[2] Pay attention here, that's some skillful foreshadowing right there.

[3] Moved from one row of eqiad to another, for those keeping score.

[4] If you've been following at home, you already see where this is heading.

[5] Also, the change was done on a Friday.  "But it's just a trivial
change!"

[6] Well, two morals if you count the "Don't do a change before you
leave for the weekend!" beating I also gave myself.