How are we doing in our strive for operational excellence? Read on to find out!
Incidents
7 documented incidents in July, and 4 in August (Incident graphs). Read more about past
incidents at Incident status on Wikitech.
Impact: For 16 minutes, edits and previews for pages with Score musical notes were slow or unavailable.
Impact: For several days, Thumbor p75 service response times gradually regressed by several seconds.
Impact: For 5 minutes, the MediaWiki API cluster in eqiad responded with higher latencies or errors.
Impact: For 13 minutes, the mobileapps service was serving HTTP 503 errors to clients.
Impact: No observed public-facing impact. Internal clean up took some work, e.g. for Ganeti VMs.
Impact: For 20 minutes, there was a small increase in error responses
for thumbnails served from the Eqsin data center (Singapore).
Impact: For 10-15 minutes, a portion of wiki traffic from Eqiad-served
regions was lost (about 1M uncached requests). For ~30 minutes,
Phabricator was unable to access its database.
Impact: During planned downtime, other hosts ran out of space due to accumulating logs. No external impact.
Impact: No external impact.
Impact: For 7 hours, all Beta Cluster sites were unavailable.
Impact: For 36 minutes, errors were noticeable for some editors. Saving edits was unaffected.