Hello!
TL;DR: Our recent elasticsearch cluster restart did not go as planned.
Most important lesson learned: we did not understand the recovery
settings correctly.
Yesterday, we did a cold restart of the elasticsearch / cirrus eqiad
cluster. This restart did not go as planned. It did not generate any
user facing impact, since we moved all the traffic to codfw before the
restart. It did impact logstash (more of that in a different report).
Incident documentation:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastic…
Have fun!
Guillaume
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST