[Engineering] Fwd: eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Faidon Liambotis faidon at wikimedia.org
Fri Apr 7 14:05:05 UTC 2017


Hi all,

Below is my email to wikitech-l about the upcoming datacenter
switchover. A lot of different teams across Product & Tech had to
collaborate for this to happen, so I'm sure it won't come as a total
surprise to most of you.

As usual, we would like to request for everyone to stay on high alert
during these days (especially April 19th & May 3rd) and if possible and
useful (e.g. if you're a component maintainer), be around at the time of
the switch.

If we need to reach you urgently by phone, we'll typically do so using
the phone number provided on Office Wiki's Contact List[1]. You should
do the same if you notice any issues in non-working hours or if other
realtime communication mediums (like IRC) fail.

Thank you all for the help. I'll keep you updated.

Best,
Faidon


1: https://office.wikimedia.org/wiki/Contact_list

----- Forwarded message from Faidon Liambotis <faidon at wikimedia.org> -----

Date: Fri, 7 Apr 2017 16:58:09 +0300
From: Faidon Liambotis <faidon at wikimedia.org>
To: Wikimedia developers <wikitech-l at lists.wikimedia.org>
Subject: eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Hi all,

You may have heard already that, like last year, we are planning to
switch our active datacenter from eqiad to codfw in the week of April
17th and back to eqiad two weeks later, on the week of May 1st. We do
this periodically in order to exercise our ability to run from the
backup site in case of a disaster, as well as our ability to switch
seamlessly to it with little user impact.

Switching will be a gradual, multi-step process, the most visible step
of which will be the switch of MediaWiki application servers and
associated data stores. This will happen on April 19th (eqiad->codfw)
and May 3rd (codfw->eqiad), both at 14:00 UTC. During those windows, the
sites will be placed into read-only mode, for a period that we estimate
to last approximately 20 to 30 minutes.

Furthermore, the deployment train will freeze for the weeks of April
17th and May 1st[1], but operate normally on the week of April 24th, in
order to exercise our ability to deploy code while operating from the
backup datacenter.

1: https://wikitech.wikimedia.org/wiki/Deployments

Compared to last year we have improved our processes considerably[2], in
particular by making more services operate in an active/active manner,
as well as by working on an automation and orchestration framework[3] to
perform parallel executions across the fleet. The core of the MediaWiki
switchover will be performed semi-automatically using a new software[4]
that will execute all the necessary commands in sequence with little
human involvement, and thus lowering the risk of introducing errors and
delays.

2: https://wikitech.wikimedia.org/wiki/Switch_Datacenter
3: https://github.com/wikimedia/cumin
4: https://github.com/wikimedia/operations-switchdc

Improving and automating our processes means that we're not going to be
following the exact same steps as last year. Because of that, and
because of other changes introduced in our environment over the course
of the year, there is a possibility of errors creeping into the process.
We'll certainly try to fix any issues that arise during those weeks and
we'd like to ask everyone to be on high-alert and vigilant.

To report any issues, please use one of the following channels:

1. File a Phabricator issue with project #codfw-rollout
2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent, or
during the migration)
3. Send an e-mail to the Operations list: ops at lists.wikimedia.org (any time)

Thanks,
Faidon
--
Faidon Liambotis
Principal Operations Engineer
Acting Director of Technical Operations
Wikimedia Foundation

----- End forwarded message -----



More information about the Engineering mailing list