[Labs-l] Preliminary Partial Outage report, 22 Sep, 6pm UTC-7

Mon Sep 23 01:41:36 UTC 2013

I figured that labs-l may be interested in my initial incident report as it
caused a partial outage for labs. anything starting with a cr is "core
router." I have removed vendor names from this report.

Root cause : flapping link and then light received but no packets passing
through the cr1-sdtpa to cr2-eqiad (preferred) link caused ospf and bgp to
partially go down between the datacenters, causing an outage for production
for those coming in via Tampa and an outage for labs for those coming in
via Eqiad.

17:52 link between cr1-sdtpa and cr2-eqiad starts flapping
18:04 first icinga reports of downtime
18:05 link stops flapping but no traffic will pass through it
18:10 leslie: switched ospf metric on the cr2-eqiad to cr1-sdtpa link to
try to make the traffic route via the hopefully working link
18:10 services start reporting back online, outage for labs coming in via
eqiad is alleviated, outage for most transit coming in via Tampa is
alleviated.
18:18 Folks coming in via $MAJOR_TRANSIT_PROVIDER, via $OTHER_PROVIDER
transit finally can reach site again (possibly route dampening from
flapping?  Unrelated outage? Some sort of physical cut in the area?)

I have called $VENDOR and have ticket X. They say they are fine between
Tampa and Orlando, and are calling $OTHERVENDOR about the wave now. The
link is currently passing traffic, but is still downpreffed via ospf to
40,000.

Semi-related but important note - Leslie set log-updown for all bgp peers
in order to facilitate better investigations into outages.

-- 
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20130922/322cc878/attachment.html>