Re: [Wikitech-l] Wikipedia is down

27 Oct 2015

On 27 October 2015 at 09:57, Brad Jorsch (Anomie) &lt;bjorsch(a)wikimedia.org&gt;
wrote:

...
  On Tue, Oct 27, 2015 at 8:02 AM, Risker
&lt;risker.wp(a)gmail.com&gt; wrote:

  The incident report does not go far enough back
into the history of the
 incident.  It does not explain how this code managed to get into the
 deployment chain with a fatal error in it. 

 Actually, it does. Erik writes "This occured because the patch for the
 CirrusSearch repository that removed the schema should have been deployed
 before the change that adds it to the WikimediaEvents repository."

 In other words, there was nothing wrong with the code itself. The problem
 was that the multiple pieces of the change needed to be done in a
 particular order during the manual backporting process, but they were not
 done in that order.

 If this had waited for the train deployment, both pieces would have been
 done simultaneously and it wouldn't have been an issue, just as it wasn't
 an issue when these changes were done in master and automatically deployed
 to Beta Labs.

 That's a start, Brad.  But even as someone who has limited experience with
 software deployment, I can think of at least half a dozen questions that
I'd be asking here:

   - Why wasn't it part of the deployment train
   - As a higher level question, what are the thresholds for using a SWAT
   deployment as opposed to the regular deployment train, are these standards
   being followed, and are they the right standards. (Even I notice that most
   of the big problems seem to come with deployments outside of the deployment
   train.)
   - How was the code reviewed and tested before deployment
   - Why did it appear to work in some contexts (indicated in your response
   as master and Beta Labs) but not in the production context
   - How are we ensuring that deployments that require multiple sequential
   steps are (a) identified and (b) implemented in a way that those steps are
   followed in the correct order

Notice how none of the questions are "what was wrong with the code" or
"who
screwed up".  They're all systems questions. This is a systems problem.
Even in situations where there *is* a problem with the code or someone
*did* screw up, the root cause usually comes back to having single points
of failure (e.g. one person having the ability to [unintentionally] get
problem code deployed, or weaknesses in the code review and testing
process).

Risker/Anne

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikipedia is down