I just want to say thank you so much for these emails, they're great on
their own, but together they paint a clear picture at a level usually
inaccessible for those of us outside everyday mw development. Thank you!
On Sat, Dec 11, 2021 at 20:39 Krinkle <krinkle(a)fastmail.com> wrote:
How’d we do in our strive for operational excellence
last month? Read on
to find out!
Incidents
6 documented incidents last month. That's above the two-year and five-year
median of 4 per month (per Incident graphs
<https://codepen.io/Krinkle/full/wbYMZK>).
2021-11-04 large file upload timeouts
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts>;
Impact: For 9 months, editors were unable to upload large files (e.g. to
Commons). Editors would receive generic error messages, typically after a
timeout. In retrospect, a dozen different distinct production errors had
been reported and regularly observed that were related and provided
different clues, however most of these remained untriaged and
uninvestigated for months. This may be related to the affected components
having no active code steward.
2021-11-05 TOC language converter
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-05_TOC_language_converter>;
Impact: For 6 hours, wikis experienced a blank or missing table of contents
on many pages. For up to 3 days prior, wikis that have multiple language
variants (such as Chinese Wikipedia) displayed the table of contents in an
incorrect or inconsistent language variant (which are not understandable to
some readers).
2021-11-10 cirrussearch commonsfile outage
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage>;
Impact: For ~2.5 hours, the Search results page was unavailable on many
wikis (except English Wikipedia). On Wikimedia Commons the search
suggestions feature was unresponsive as well.
2021-11-18 codfw ipv6 network
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-18_codfw_ipv6_network>;
Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6
connectivity for
upload.wikimedia.org. This did not affect availability
of the service because the "Happy Eyeballs
<https://en.wikipedia.org/wiki/Happy_Eyeballs>" algorithm ensures
browsers (and other clients) automatically fallback to IPv4. The Codfw
cluster generally serves Mexico and parts of the US and Canada. The
upload.wikimedia.org service serves photos and other media/document
files, such as displayed in Wikipedia articles.
2021-11-23 core network routing
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-23_Core_Network_Routing>;
Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data
centers via public IP addresses. This was due to a BGP routing error. There
was no impact on end-user traffic, and impact on internal traffic was
limited (only Icinga alerts themselves) because internal traffic generally
uses local IP subnets which we currently route with OSPF instead of BGP.
2021-11-25 eventgate-main outage
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage>;
Impact: For about 3 minutes, eventgate-main was down. This resulted in
25,000 MediaWiki backend errors due to inability to queue new jobs. About
1000 user-facing web requests failed (HTTP 500 Error). Event production
briefly dropped from ~3000 per second to 0 per second.
Incident follow-up
Remember to review and schedule Incident Follow-up work
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator,
which are preventive measures and tech debt mitigations written down after
an incident is concluded. Read more about past incidents at Incident
status <https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.
Recently resolved incident follow-up:
Disable DPL on wikis that aren't using it
<https://phabricator.wikimedia.org/T287916>
Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal
(Legoktm).
Create easy access to MySQL ports for faster incident response and
maintenance <https://phabricator.wikimedia.org/T291352>
Filed in Sep 2021, and carried out by Stevie (Kormat).
Create paging alert for primary DB hosts
<https://phabricator.wikimedia.org/T233684>
Filed after a Sept 2019 incident, done by Stevie (Kormat).
Trends
November saw 27 new production error reports of which 14 were resolved,
and 13 remain open and carry over to the next month.
Of the 301 errors still open from previous months, 16 were resolved.
Together with the 13 carried over from November that brings the workboard
to 298 unresolved tasks.
Figure 1: Unresolved error reports by month
<https://phabricator.wikimedia.org/phame/post/view/261/production_excellence_38_november_2021/#trends>
.
Outstanding errors
Take a look at the workboard and look for tasks that could use your help.
→
https://phabricator.wikimedia.org/tag/wikimedia-production-error/
💡 Did you know:
*To find your team's error reports, use the appropriate **"Filter" link
in the sidebar of the workboard**.*
Issues carried over from recent months:
Apr 2021:
9 of 42 issues left.
May 2021:
16 of 54 issues left.
Jun 2021:
9 of 26 issues left.
Jul 2021:
11 of 31 issues left.
Aug 2021:
10 of 46 issues left.
Sep 2021:
10 of 24 issues left.
Oct 2021:
20 of 49 issues left.
Nov 2021:
13 of 27 new issues
<https://phabricator.wikimedia.org/maniphest/query/0W0Nuk9umBDc/#R> are
carried forward.
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving
problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read later via
https://phabricator.wikimedia.org/phame/post/view/261/
_______________________________________________
Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/