Hi,
We will be upgrading the Toolforge Kubernetes cluster next Monday
(2023-04-03) starting at around 10:00 UTC.
The expected impact is that tools running on the Kubernetes cluster will
get restarted a couple of times over the course of the few hours it
takes for us to upgrade the entire cluster. The ability to manage tools
will remain operational.
Since the version we're upgrading to (1.22) removes a bunch of
deprecated Kubernetes APIs, tools that use kubectl and raw Kubernetes
resources directly may want to check that they're on the latest
available versions. The vast majority of tools that are only using the
Jobs framework and/or the webservice command are not affected by these
changes.
Taavi
There will be two major Toolforge outages this coming week. Each outage
will cause tool downtime and may require manual restarts afterwards.
The first outage is an NFS migration [0] and will take place on Monday,
beginning at around 0:00 UTC and lasting well into the day, possibly as
late as 19:00 UTC. During this long period, Toolforge NFS will be
read-only. This will cause most tools (for example, anything that writes
a log file) to fail.
The second outage will be a database migration [1] and will take place
on Thursday at 17:00UTC. During this window ToolsDBwill be read-only.
This migration should take about an hour but unexpected side-effects may
extend the downtime.
We try very hard to avoid outages of this magnitude, but at this point
we need to choose downtime over the increasing risk of data loss.
More details can be found below.
[0] NFS Outage and system reboots Monday: The existing toolforge NFS
server is running on aging hardware and lacks a straightforward path for
maintenance or upgrading. To improve this we are moving NFS to a
cinder+VM platform which should support easier upgrades, migrations, and
expansions in the future. In order to maintain data integrity during the
migration, the old server will need to be made read-only while the last
set of file changes is synchronized with the new server. Because the NFS
service is actively used, it will take many hours to complete the final
sync.
To ensure stable mounts of the new server, every node in Toolforge will
be rebooted as part of this migration. That means that even tools which
do not use NFS will be affected, although most tools should restart
gracefully.
This task is documented as https://phabricator.wikimedia.org/T333477.
[1] DB outage Thursday: As part of the ongoing effortto upgrade
user-created Toolforge databases, we willmigrate ToolsDB to a new VM
that will have a more recent version of Debian and MariaDB and will use
a more resilient storage solution.
The new VM is ready, and we plan to point all tools to use it on *Apr, 6
2023 at 17:00 UTC*.
This will involve about *1 hour of read-only time*for the database. Any
existing database connection will be terminated, and if your tool does
not reconnect automatically you might have to restart it manually.
An email will be sent shortly before starting the migration, and when
it's finished.
Please also make sure your tool is connecting to the database using the
canonical hostname *tools.db.svc.wikimedia.cloud*and not any other
hostname or IP address.
For more details, and to report any issue, you can read or leave a
comment at https://phabricator.wikimedia.org/T333471
For more context you can also check out the parent task
https://phabricator.wikimedia.org/T301949
On 3/30/23 8:24 AM, Roy Smith wrote:
> Just to make sure I'm clear, the downtime announced yesterday
> <https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.…> is
> still happening?
That's correct, the upcoming downtimes are still happening. These three
projects are largely unrelated so we're trying to not do them all at the
same time.
>
>> On Mar 30, 2023, at 6:42 AM, Arturo Borrero Gonzalez
>> <aborrero(a)wikimedia.org> wrote:
>>
>> On 3/28/23 00:13, Taavi Väänänen wrote:
>>> Hi,
>>> We will be upgrading the Toolforge Kubernetes cluster next Monday
>>> (2023-04-03) starting at around 10:00 UTC.
>>> The expected impact is that tools running on the Kubernetes cluster
>>> will get restarted a couple of times over the course of the few
>>> hours it takes for us to upgrade the entire cluster. The ability to
>>> manage tools will remain operational.
>>> Since the version we're upgrading to (1.22) removes a bunch of
>>> deprecated Kubernetes APIs, tools that use kubectl and raw
>>> Kubernetes resources directly may want to check that they're on the
>>> latest available versions. The vast majority of tools that are only
>>> using the Jobs framework and/or the webservice command are not
>>> affected by these changes.
>>
>> This has been rescheduled to Monday 2023-04-10 to leave room for the
>> other operations we have.
>>
>> regards.
>>
>> --
>> Arturo Borrero Gonzalez
>> Senior SRE / Wikimedia Cloud Services
>> Wikimedia Foundation
>> _______________________________________________
>> Cloud-announce mailing list -- cloud-announce(a)lists.wikimedia.org
>> List information:
>> https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.…
>
>
> _______________________________________________
> Cloud mailing list --cloud(a)lists.wikimedia.org
> List information:https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimed…
Due to unavoidable network switch maintenance[0], some WMCS services
will be offline briefly tomorrow. The downtime will last for 20-30
minutes and take place sometime between 14:00 and 16:00 UTC.
Here is what to expect during the downtime:
* *Toolsdb will be unavailable and all queries will fail*
* Some of the wiki replica databases may be unavailable
* Some DNS servers will be offline; some services may fail to resolve
hosts, depending on their fallback logic
We anticipate a graceful recovery from this outage, but NFS is fickle so
we may need to reboot some or all VMs after the outage.
Sorry in advance for any inconvenience or upset emails that result from
this maintenance.
- Andrew + the WMCS team
[0] https://phabricator.wikimedia.org/T330165
PAWS will be switching k8s clusters to get to the latest k8s that openstack
currently supports (1.23). This should occur on 2023-03-20 around 13:00
UTC. Anything that was running at the time on the current (old) cluster
will need restarted.
https://phabricator.wikimedia.org/T328489
--
*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
From 12:00 to 15:00 UTC on 2023-03-14 PAWS is cutting over its nfs storage.
As a result anything saved during this time frame will likely be lost.
Please do not save any files (or have bots save any files) during this time
as they likely will not make it through the cutover.
We'll send out an email to cloud-announce noting when the cutover is done
and it is safe to save files again.
More information can be found in the following tickets:
https://phabricator.wikimedia.org/T331056https://phabricator.wikimedia.org/T303663https://phabricator.wikimedia.org/T301280
Thank you,
--
*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
I am in the process of standardizing[0] the role names in WMCS cloud-vps
to conform with upstream conventions[1]. That requires me to rename two
existing user roles, 'user' and 'projectadmin':
- The role previously called 'user' will now be called 'reader'
- The role previously called 'projectadmin' will now be called 'member'
Despite the (IMO) less obvious names, a 'reader' can still log into
project VMs, and a 'member' can still create and delete VMs. Taavi has
thoughtfully upgraded the documentation about what roles can do what;
the complete docs can be found at
https://wikitech.wikimedia.org/wiki/Help:Cloud_services_user_roles_and_righ…
This renaming is phase one; phase two will involve switching to the
default upstream access rules for these two new roles.
Right now the old and new roles are co-existing in our system, but soon
I will entirely delete the old 'user' and 'projectadmin' roles. In the
meantime, please let me know if you find stray references to the old
role names, or if you find yourself unable to perform Horizon actions[1]
that you were previously able to do. Or, more seriously, able to do
things that you were not previously able to do!
Sorry for any inconvenience caused!
-Andrew
[0] Our OpenStack deployment has a very long history; it is older than
most deployments. That means that many conventions established in our
cloud now differ from the consensus standards created among newer
clouds. Periodically I try to update our cloud to conform to these new
standards; it reduces tech debt and also increases the chances that
official OpenStack documentation will be useful to our users.
[1] https://phabricator.wikimedia.org/T330759
[2] There is one edge case in Horizon that may require you to switch
projects in order to refresh the role permissions.
Hi there!
Today 2023-03-06, in a few minutes, we will restart the Toolforge internal
network, A brief interruption of network communications is expected during the
maintenance.
This is because we're re-deploying calico to the kubernetes cluster [0].
No action required on your side.
regards.
[0] https://phabricator.wikimedia.org/T328539
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
As part of the ToolsDB migration work [1], in about 1 hour from now I
will stop ToolsDB for a very short time (I expect the downtime to last
approximately 2 minutes).
You can follow along and report any issues in the #wikimedia-cloud IRC channel.
Thanks,
Francesco
[1] https://phabricator.wikimedia.org/T301949
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation