How’d we do in our strive for operational excellence last month? Read on to find out!
There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).
2021-10-08 network provider; Impact: For upto an hour, some regions experienced a partial
connectivity outage. This primarily affected the US East Coast for 13
minutes, and Russia for 1 hour. It was caused by a routing problem with
one among several network providers.
2021-10-29 graphite; Impact: During a server upgrade, historical data was lost for a subset
of Graphite metrics. Some were recovered via the redundant server, but
others were lost as the redundant was also upgraded since then and lost
some in a similar fashion.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations
written down after an incident is concluded. Read about past incidents
at Incident status on Wikitech.
Trends
Previously, the production error workboard held an accumulated total
of 298 still-open error reports. We resolved 20 of those. Together with
the 23 new errors carried over from October, this brings us to 301
unresolved errors on the board.
Figure 1: Unresolved error reports by month.
Apr 2021: | 9 of 42 issues left. |
May 2021: | 16 of 54 issues left. |
Jun 2021: | 9 of 26 issues left. |
Jul 2021: | 12 of 31 issues left. |
Aug 2021: | 12 of 46 issues left. |
Sep 2021: | 11 of 24 issues left. |
Oct 2021: | 23 of 49 new issues are carried forward. |
Until next time,
– Timo Tijhof