On Wed, Jan 4, 2023 at 12:41 PM Taavi Väänänen <hi(a)taavi.wtf> wrote:
Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm
planning to make in the near term. Most of this should be visible on
Phabricator as well, but I wanted to make everyone aware here regardless
since following activity on Phabricator is hard and I don't want to
cause any major surprises here.
Thank you for thinking about it this way. You are totally correct that
I have seen things happening in Phab and Gerrit, but I didn't know
where things were at on a more holistic level.
I'm sure some of this will have an effect on our
users and I/we need to
communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance:
grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current
grafana-cloud.wikimedia.org (aka
grafana-labs.wikimedia.org) Grafana
instance with a new one. The reason is that the current one runs
directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream
Grafana changes it soon won't be able to reach out to Prometheus
instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting
from
https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
Can we leave an HTTP redirect running somewhere to make old links at
least end up on
grafana.wmcloud.org even if we can't guarantee that
the dashboard they led to is still around? My fingers and Firefox
awesomebar completion will thank you! :)
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs.
This was the primary blocker for getting rid of Diamond (a Python 2
program that collected node metrics and pushed them to Graphite). I hope
that this transition will be mostly invisible to users if we migrate the
most used Grafana dashboard (cloud-vps-project-board) to pull the
metrics from Prometheus instead.
This is tracked as T317032.
The main thing I know that Diamond removal has affected is Timo's
https://nagf.toolforge.org/ tool. I think the cloud-vps-project-board
Grafana dashboard is a reasonable replacement for that tool that Timo
built long ago to do pretty much the same job, but there are some
places that we should replace links to NAGF with links to the new
Grafana instead. Openstack-browser and [[wikitech:Template:Nova
Resource]] are the ones I'm remembering at the moment.
== 3. Statsd/Graphite removal (once Diamond is gone)
==
My understanding is that the statsd/Graphite service was originally not
intended as a generic service for cloud-vps users (although it certainly
is used like one today). Either way we don't really have a good
replacement for it except some limited cases that could use
node-exporter text files instead. I'm not sure how big of a deal that is
if we never claimed to support it anyway?
This is tracked as T326266.
Deployment-prep's
https://phabricator.wikimedia.org/T241285 is a
thing, but maybe we don't actually have a strong use case for a
replacement as Taavi has noted there in the past. Jean-Fred also uses
it from Toolforge (<https://phabricator.wikimedia.org/T325936>). ORES
used to use it too, but that may be all dead tech at this point as
well. It is also probably time to close
https://phabricator.wikimedia.org/T241284 as WONTFIX.
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808