How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).

2021-10-08 network provider; Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for 13 minutes, and Russia for 1 hour. It was caused by a routing problem with one among several network providers.

2021-10-22 eqiad networking; Impact: For 40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in 10 minutes.

2021-10-25 s3 db replica; Impact: For 30min MediaWiki backends were slower than usual. For 12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.

2021-10-29 graphite; Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.


Trends


Norwegian blue 🐦

298 bugs were up on the board.
We solved 20 of those over the past thirty days.

How many might now be left unexplored?
We also added new bugs to our database.

Half those bugs are pining for their fjord.
The other 23 carry on, with their dossiers.

All in all, 301 bugs up on the board.


In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.

Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.

Outstanding errors

Take a look at the workboard and look for tasks that could use your help:

Issues carried over from recent months:
Apr 2021:
9 of 42 issues left.
May 2021:
16 of 54 issues left.
Jun 2021:
9 of 26 issues left.
Jul 2021:
12 of 31 issues left.
Aug 2021:
12 of 46 issues left.
Sep 2021:
11 of 24 issues left.
Oct 2021:
23 of 49 new issues are carried forward.

Thanks

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof



🔗  Share or read later via https://phabricator.wikimedia.org/phame/post/view/260/