[QA] CI response time investigation and improvements

Greg Grossmeier greg at wikimedia.org
Thu Mar 23 22:10:36 UTC 2017


Hello,

Lately the performance of our Continuous Integration service has been
less than ideal during our busy hours. This has been going on for at
least two weeks and is especially noticeable during the hours from 6pm
to midnight UTC.

The typical symptom is that jobs takes longer than usual to execute, in
some occasion three times the usual duration. As a result, the jobs pile
up in the queue and the reporting of test results is delayed.

Jobs receiving Code-Review +2 are prioritized over all other jobs. When
a lot of changes are merged (+2'd) they will consume all of the
slots/executers thus delaying all other jobs..

As well, some heavy jobs might reach the default 30 minutes time-out;
for example Wikibase and thus fail tests for that change.

In response we are working on a sprint[0] to address these issues. In
short the sprint aims to:
* prioritize a few more builds that we care about (notably
  operations/puppet.git changes) before the default priority
* make the tests we do run more efficient in a few ways, notably in
  the number of executors/vms needed
* stop running tests that have little utility


Thanks for your patience and understanding,

Greg

[0] https://phabricator.wikimedia.org/project/view/2676/

-- 
| Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
| Release Team Manager            A18D 1138 8E47 FAC8 1C7D |



More information about the QA mailing list