Toolforge Kubernetes cluster capacity issues - Cloud-admin

2 Dec 2023

We ran out of capacity on the Toolforge Kuberneter cluster yesterday,
seemingly due to a large number of tools migrating from the grid engine to
Kubernetes and a temporary decrease in capacity during a cluster-wide
reboot to recover from a NFS blip. I've provisioned some extra nodes to fix
the immediate issue, but the total CPU requests are still around 90% of the
total cluster capacity. (Note that this does not mean that we're using 90%
of CPU power available there, I'll come back to this in a bit.)

*In case the cluster starts acting up again*: follow
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolfor…
to provision more capacity. That runbook also has a link to the Grafana
dashboard for cluster capacity and instructions on what specific metrics to
worry about there, given that there are no alerts for it yet
<https://phabricator.wikimedia.org/T352581>.

As I said, we seem to be overprovisioning CPUs by a lot compared to actual
usage: `kubectl sudo top node` shows a majority of nodes being below 10% of
actual CPU utilization. So in the near term we should look at tweaking the
resource allocation logic especially for web services.

Taavi

-- 
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation