Hi there!
On 2022-11-28 and 2022-11-29 there has been some misleading emails being
sent: you may have receive one (or more) emails about puppet failures on
your Cloud VPS virtual machine.
Moreover, such emails were a bit contradictory, with messages like
"No failed resources", and "No exceptions happened".
There was a problem in the way the puppet errors were calculated that
has been now fixed [0].
This does not affect Toolforge.
sorry for the noise,
regards.
[0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/861805/
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
Today 2022-11-22 at about 12:25 UTC, as part of a routine operation I
reimaged/reformated a cloudvirt hypervisor without relocating all the
virtual machines first.
The data survived the reimage, but the 32 (!) affected virtual machines
were briefly unavailable and then hard-rebooted.
All virtual machines are now ACTIVE (up and running) from the openstack
point of view, but please, let me know if you need assistance recovering
them in any way.
As of this writing we don't have any automation to ensure we only
reimage empty hypervisors, but we're working on it, to prevent this kind
of human errors in the future.
regards. (and sorry!)
(!) Affected virtual machines are:
- ID: 78782628-4f9f-4263-84fc-06e767b3bfe1
Name: mx-wiki
- ID: 1fa9f0d9-42e8-4273-bdb1-a7d49998c13f
Name: synapse01
- ID: 2382fda0-e683-4d0c-95b6-bbbf323904d9
Name: canary1048-04
- ID: 4b570277-e51f-459d-bea2-394c5ad7bc92
Name: tools-sgeexec-10-16
- ID: 66529c1b-f3a3-4ff8-b30d-785f4f274965
Name: feature-store-test
- ID: e153f69a-46a0-458a-ab50-de3d86aa861b
Name: toolsbeta-test-k8s-worker-7
- ID: c3a2d1a9-f811-4da9-afba-3a113c8ff729
Name: wbregistry-02
- ID: 2b56c575-08a5-4def-87cb-bee5bd43e4f9
Name: prod
- ID: 141ac13c-f0fa-46d3-9d2a-cede8bc854c6
Name: devtools-puppetdb1001
- ID: fdb15c24-0b41-42d6-9c4a-82afd1d9dcb9
Name: tools-sgeweblight-10-31
- ID: 56e55a31-8d32-455e-b650-b7194e71d2fd
Name: runner-1023
- ID: cb4a87e4-264e-4c8f-8197-3efff54346de
Name: runner-1022
- ID: 5b6b5733-565d-456e-a4fc-85ce669d3fd2
Name: deployment-mdb02
- ID: 75dce76d-36ad-4f9e-85e9-8a11ff6744db
Name: wikibase-product-testing-2022
- ID: 868d3dca-3e5c-4089-89a9-2c7e756c3e31
Name: toolsbeta-cumin-1
- ID: 42ac6d8a-453a-4620-b4b7-9c97994c98fb
Name: integration-agent-docker-1030
- ID: 084da652-503d-49a7-9ffa-98a0cd5335fd
Name: toolsbeta-sgeexec-10-5
- ID: f098fe82-18b6-49a9-962d-9b8f1f989b14
Name: pcc-worker1001
- ID: 8eb272dc-8006-4e93-a966-5035809324d9
Name: deployment-mx03
- ID: e67d0e4c-e07c-4d9a-8ddb-cb0bc8efa388
Name: deployment-docker-api-gateway01
- ID: b958511a-10cb-4e62-bdbb-6da5013dd62f
Name: soweego
- ID: 62045cf9-59ed-44b9-a268-1c9f171b5aae
Name: tools-package-builder-04
- ID: 0127e905-f52e-4ed4-b60d-260102a8e625
Name: pontoon-lb-02
- ID: 827bf744-262a-458b-951d-f2e9a377e075
Name: toolsbeta-test-k8s-ingress-3
- ID: 3e6c31d7-b4db-4a5f-a610-a74d0013f631
Name: pki-test01
- ID: 8893ba32-fb5c-4567-a242-b6c676978b7d
Name: deployment-urldownloader03
- ID: f72e5b18-6376-4ccd-9e59-64447759e53f
Name: deployment-deploy03
- ID: 006dea0a-a1eb-4de3-bf45-1a071ad87152
Name: kafka-test-cloud-2
- ID: e05220d7-8ca1-4d9f-a933-01a843286ea8
Name: toolsbeta-docker-imagebuilder-01
- ID: 416f445a-cad4-45c2-b32e-f17100f93eac
Name: cloud-puppetmaster-05
- ID: 4e492051-25a3-4442-b8b9-1959f42917fe
Name: tools-k8s-worker-76
- ID: df18863a-2da7-4951-aa69-936b3d889592
Name: deployment-docker-cpjobqueue01
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
I think we could start monitoring prometheus-node-exporter on all Cloud
VPS VMs on all projects via the Prometheus instance in metricsinfra. The
required firewall rules are now in place (thanks to Andrew in T288108),
and I've written the required patches to
cloud/metricsinfra/prometheus-manager and to the Puppet repo:
https://gerrit.wikimedia.org/r/c/cloud/metricsinfra/prometheus-manager/+/85…https://gerrit.wikimedia.org/r/c/operations/puppet/+/856917/
The main effect this will have is that we (and project admins, of
course) will have basic metrics (think CPU, disk, RAM, so on) for all
instances in all projects. Currently these wouldn't send any alerts
unless manually configured by a metricsinfra admin.
Please let me know if you have any questions or concerns, otherwise I'd
like to move forward in the next few days.
Taavi
Hi there,
Toolforge is a complex service. There are many moving parts and there
are always several people working on different pieces of it.
We have been creating informal Toolforge-specific meetings from time to
time, to unblock some decisions or to get everyone on the same page.
The proposal is to create a monthly 1h Toolforge engineering-focused
meeting called "Toolforge council".
This meeting would be open in nature:
* The WMCS/TE team
* Toolforge community root group member [0]
* Other interested parties can be invited if required
The notes and results of the meeting will be published somewhere in
wikitech and perhaps this very mailing lists.
The next two meetings of this kind will be:
* 2022-11-08 at 15:00 UTC
* 2022-12-13 at 15:00 UTC
For these next two, I will facilitate/moderate them, as well as
collect/share some agenda points beforehand.
I would like to avoid formalizing any other protocols regarding the
meeting beyond what is contained in this email. It is already an
evolution to the informal approach we have been using. Let's see how it
evolves organically?
Comments welcome (including naming hehe).
[0]
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo…
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi all!
I'm trying to gather what python versions are needed to be supported to run cookbooks, I would appreciate if you can
reply to this email telling me which version you would be running cookbooks with (directly to me is ok, to avoid spamming others ;) ).
Thanks!
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."