[Engineering] [Ops] Data center switch-over moving ahead next week: please stay available :)

Arthur Richards arichards at wikimedia.org
Thu Apr 21 15:55:23 UTC 2016


This is so rad - congratulations indeed to everyone who's been working on
this!

On Thu, Apr 21, 2016 at 8:44 AM, Toby Negrin <tnegrin at wikimedia.org> wrote:

> Congrats Mark and everyone else involved. This is a big step for
> reliability and performance of the sites and a difficult technical task to
> say the least.
>
> Well done!
>
> -Toby
>
> On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma <mark at wikimedia.org> wrote:
>
>> We've just completed the switch back, and all services are running from
>> our main data center eqiad (Ashburn) again.
>>
>> The process went very smooth this time around. In the past two days
>> leading up to this, we've been able to either fix or work around the most
>> important issues we encountered on Tuesday. This meant that we had no real
>> setbacks or unanticipated delays today, and therefore were able to complete
>> the most time pressing and user-impacting part (during which MediaWiki is
>> read-only) in 20 minutes, down from ~45 minutes two days ago.
>>
>> However, we'll be doing this again in the future, and until then we'll
>> work on improving and further automating this process to get it down to
>> hopefully much lower levels of impact and duration.
>>
>> Please let us know if you see any issues which may be caused by the
>> switch-over(s).
>>
>> Thanks much to everyone involved!
>>
>> Mark
>>
>>
>>
>> On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma <mark at wikimedia.org> wrote:
>>
>>> Hi everyone,
>>>
>>> After we've been successfully serving our sites from our backup
>>> data-center codfw (Dallas) for the past two days, we're now starting our
>>> switch back to eqiad (Ashburn) as planned[1].
>>>
>>> We've already moved cache traffic back to eqiad, and within the next
>>> minutes, we'll disable editing by going read-only for approximately 30
>>> minutes - hopefully a bit faster than 2 days ago.
>>>
>>> [1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/
>>>
>>> On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma <mark at wikimedia.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Today the data center switch-over commenced as planned, and has just
>>>> fully completed successfully. We are now serving our sites from codfw
>>>> (Dallas, Texas) for the next 2 days if all stays well.
>>>>
>>>> We switched the wikis to read-only (editing disabled) at 14:02 UTC, and
>>>> went back read-write at 14:48 UTC - a little longer than planned. While
>>>> edits were possible then, unfortunately at that time Special:Recent Changes
>>>> (and related change feeds) were not yet working due to an unexpected
>>>> configuration problem with our Redis servers until 15:10 UTC, when we found
>>>> and fixed the issue. The site has stayed up and available for readers
>>>> throughout the entire migration.
>>>>
>>>> Overall the procedure was a success with few problems along the way.
>>>> However we've also carefully kept track of any issues and delays we
>>>> encountered for evaluation to improve and speed up the procedure, and
>>>> reducing impact to our users - some of which will already be implemented
>>>> for our switch back on Thursday.
>>>>
>>>> We're still expecting to find (possibly subtle) issues today, and would
>>>> like everyone who notices anything to use the following channels to report
>>>> them:
>>>>
>>>> 1. File a Phabricator issue with project #codfw-rollout
>>>> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)
>>>> 3. Send an e-mail to the Operations list: ops at lists.wikimedia.org
>>>>
>>>> We're not done yet, but thanks to all who have helped so far. :-)
>>>>
>>>> Mark
>>>>
>>>
>>> --
>>> Mark Bergsma <mark at wikimedia.org>
>>> Lead Operations Architect
>>> Director of Technical Operations
>>> Wikimedia Foundation
>>>
>>
>>
>>
>> --
>> Mark Bergsma <mark at wikimedia.org>
>> Lead Operations Architect
>> Director of Technical Operations
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Ops mailing list
>> Ops at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/ops
>>
>>
>
> _______________________________________________
> Ops mailing list
> Ops at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops
>
>


-- 
Arthur Richards
Team Practices Manager
[[User:Awjrichards]]
IRC: awjr
+1-415-839-6885 x6687
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160421/93d16972/attachment.html>


More information about the Engineering mailing list