[Engineering] [Ops] Data center switch-over moving ahead next week: please stay available :)

Toby Negrin tnegrin at wikimedia.org
Thu Apr 21 15:44:00 UTC 2016


Congrats Mark and everyone else involved. This is a big step for
reliability and performance of the sites and a difficult technical task to
say the least.

Well done!

-Toby

On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma <mark at wikimedia.org> wrote:

> We've just completed the switch back, and all services are running from
> our main data center eqiad (Ashburn) again.
>
> The process went very smooth this time around. In the past two days
> leading up to this, we've been able to either fix or work around the most
> important issues we encountered on Tuesday. This meant that we had no real
> setbacks or unanticipated delays today, and therefore were able to complete
> the most time pressing and user-impacting part (during which MediaWiki is
> read-only) in 20 minutes, down from ~45 minutes two days ago.
>
> However, we'll be doing this again in the future, and until then we'll
> work on improving and further automating this process to get it down to
> hopefully much lower levels of impact and duration.
>
> Please let us know if you see any issues which may be caused by the
> switch-over(s).
>
> Thanks much to everyone involved!
>
> Mark
>
>
>
> On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma <mark at wikimedia.org> wrote:
>
>> Hi everyone,
>>
>> After we've been successfully serving our sites from our backup
>> data-center codfw (Dallas) for the past two days, we're now starting our
>> switch back to eqiad (Ashburn) as planned[1].
>>
>> We've already moved cache traffic back to eqiad, and within the next
>> minutes, we'll disable editing by going read-only for approximately 30
>> minutes - hopefully a bit faster than 2 days ago.
>>
>> [1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/
>>
>> On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma <mark at wikimedia.org> wrote:
>>
>>> Hi all,
>>>
>>> Today the data center switch-over commenced as planned, and has just
>>> fully completed successfully. We are now serving our sites from codfw
>>> (Dallas, Texas) for the next 2 days if all stays well.
>>>
>>> We switched the wikis to read-only (editing disabled) at 14:02 UTC, and
>>> went back read-write at 14:48 UTC - a little longer than planned. While
>>> edits were possible then, unfortunately at that time Special:Recent Changes
>>> (and related change feeds) were not yet working due to an unexpected
>>> configuration problem with our Redis servers until 15:10 UTC, when we found
>>> and fixed the issue. The site has stayed up and available for readers
>>> throughout the entire migration.
>>>
>>> Overall the procedure was a success with few problems along the way.
>>> However we've also carefully kept track of any issues and delays we
>>> encountered for evaluation to improve and speed up the procedure, and
>>> reducing impact to our users - some of which will already be implemented
>>> for our switch back on Thursday.
>>>
>>> We're still expecting to find (possibly subtle) issues today, and would
>>> like everyone who notices anything to use the following channels to report
>>> them:
>>>
>>> 1. File a Phabricator issue with project #codfw-rollout
>>> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)
>>> 3. Send an e-mail to the Operations list: ops at lists.wikimedia.org
>>>
>>> We're not done yet, but thanks to all who have helped so far. :-)
>>>
>>> Mark
>>>
>>
>> --
>> Mark Bergsma <mark at wikimedia.org>
>> Lead Operations Architect
>> Director of Technical Operations
>> Wikimedia Foundation
>>
>
>
>
> --
> Mark Bergsma <mark at wikimedia.org>
> Lead Operations Architect
> Director of Technical Operations
> Wikimedia Foundation
>
> _______________________________________________
> Ops mailing list
> Ops at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160421/9ace9ae6/attachment-0001.html>


More information about the Engineering mailing list