Cloud-admin March 2020

cloud-admin@lists.wikimedia.org

5 participants
5 discussions

Re: [Cloud-admin] Update 1 labsdb host to buster and 10.4

by Manuel Arostegui

Thanks Bryan - sorry for not answering faster, but looks like you only replied to cloud-admin and I am not there :-) Today in our 1:1 this subject came up and Jaime forwarded me the mail, as he is in cloud-admin hehe. Answers in line! On Mon, Mar 30, 2020 at 1:02 PM Jaime Crespo <jcrespo(a)wikimedia.org> wrote: > ---------- Forwarded message --------- > From: Bryan Davis <bd808(a)wikimedia.org> > Date: Tue, Mar 24, 2020 at 3:56 PM > Subject: Re: [Cloud-admin] Update 1 labsdb host to buster and 10.4 > To: Cloud Services administration and infrastructure discussion > <cloud-admin(a)lists.wikimedia.org> > > > On Tue, Mar 24, 2020 at 2:36 AM Manuel Arostegui > <marostegui(a)wikimedia.org> wrote: > > > > So far we have had normal 1 instance hosts upgraded, multi-instance (2 > mysqld processes) upgraded, and we need to have a multisource (labsdb) host > upgrade, to make sure 10.4 works fine or to know what might need work > (mysqld-exporter https://phabricator.wikimedia.org/T247290 or whatever), > better to know in advance. > > > > 10.4 also fixes some bugs that are hitting labsdb hosts specifically: > > - Grants race condition: https://jira.mariadb.org/browse/MDEV-14732 > > - GTID works on multisource: https://jira.mariadb.org/browse/MDEV-12012 > this is one of the early bugs we filed with MariaDB almost 3 years ago and > looks like it is now working even though - this requires some work on the > master's side, but my last tests are looking good and if we can enable GTID > on labsdb hosts that'd we be a BIG improvement towards avoiding corruption > during a crash. > > These all sound like good things. And thank you very much, seriously, > for the effort you have been putting into thinking about and caring > for the wiki replicas. > You are welcome! :-) > > > So, any objections to reimage labsdb1011 as Buster and 10.4 (/srv won't > be formatted, so we don't have to rebuild that host). > > Any idea what the roll back plan would look like if it turns out that > something about 10.4 and multisource do not work well together? Would > it be less risky to do labsdb1012 first and see how it works there? > The rollback plan is basically, reimage back to Stretch and reclone from labsdb1012. The idea to use labsdb1011 is to actually test 10.4 in this very unique environment (lots of heavy queries). Labsdb1012 is barely used, and only a few days during the month, so it wouldn't be representative. Also, the rollback is easier as we can use labsdb1012, as it is normally fine to stop it for 24h (as long as it is not during the few days it is used at the start of each month), so no user impact there. Whereas, stopping a wiki replica, means we have to put more pressure on the other 2 hosts for the time it is stopped. Does this make sense and answer your question? Thank you! Manuel.

4 years, 1 month

Update 1 labsdb host to buster and 10.4

by Manuel Arostegui

Hello! As I have commented on some tickets, MariaDB 10.1 will reach its end-of-life 17th October 2020 ( https://mariadb.com/wp-content/uploads/2019/07/mariadb-engineering-policies… ). Debian Buster doesn't ship 10.1, but 10.4. We have been testing 10.4 in production for quite some time already ( https://phabricator.wikimedia.org/T242702 https://phabricator.wikimedia.org/T246604) We have filed some bugs, but nothing super worrying have been observed in production (bugs: https://jira.mariadb.org/browse/MDEV-21794 https://jira.mariadb.org/browse/MDEV-21813 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953040) A few weeks ago, during some Wikidata overloads, we observed that the host with Buster and 10.4 performed better, CPU wise, than the ones running 10.1 and Stretch with InnoDB compression, having the same hardware. We haven't found 100% why, but we believe that 10.4 might have some optimizations. Given that there've been some Quarry reports about being slow ( https://phabricator.wikimedia.org/T246970 https://phabricator.wikimedia.org/T247978) Considering the fact that Quarry points to labsdb1011, and in order to discard a specific issue with that host we thought about replacing labsdb1011 with another host ( https://phabricator.wikimedia.org/T247978#5980018) but maybe we can just try to upgrade labsdb1011 to buster and 10.4 So far we have had normal 1 instance hosts upgraded, multi-instance (2 mysqld processes) upgraded, and we need to have a multisource (labsdb) host upgrade, to make sure 10.4 works fine or to know what might need work (mysqld-exporter https://phabricator.wikimedia.org/T247290 or whatever), better to know in advance. 10.4 also fixes some bugs that are hitting labsdb hosts specifically: - Grants race condition: https://jira.mariadb.org/browse/MDEV-14732 - GTID works on multisource: https://jira.mariadb.org/browse/MDEV-12012 this is one of the early bugs we filed with MariaDB almost 3 years ago and looks like it is now working even though - this requires some work on the master's side, but my last tests are looking good and if we can enable GTID on labsdb hosts that'd we be a BIG improvement towards avoiding corruption during a crash. So, any objections to reimage labsdb1011 as Buster and 10.4 (/srv won't be formatted, so we don't have to rebuild that host). Cheers Manuel.

4 years, 1 month

Nova database in m5

by Manuel Arostegui

Hello! While working on m5 master I realised that the nova database has not had any writes for a year or so: root@db1133:/srv/sqldata/nova# ls -lhrt *.ibd | tail -n5 -rw-rw---- 1 mysql mysql 48M Apr 16 2019 instance_metadata.ibd -rw-rw---- 1 mysql mysql 2.9G Apr 16 2019 instance_system_metadata.ibd -rw-rw---- 1 mysql mysql 484M Apr 16 2019 block_device_mapping.ibd -rw-rw---- 1 mysql mysql 304K Apr 17 2019 compute_nodes.ibd -rw-rw---- 1 mysql mysql 128K Apr 26 2019 services.ibd Same for nova_api root@db1133:/srv/sqldata# ls -lhrt nova_api/*.ibd | tail -n5 -rw-rw---- 1 mysql mysql 128K May 23 2018 nova_api/flavor_extra_specs.ibd -rw-rw---- 1 mysql mysql 128K May 23 2018 nova_api/cell_mappings.ibd -rw-rw---- 1 mysql mysql 96K May 23 2018 nova_api/migrate_version.ibd -rw-rw---- 1 mysql mysql 176K Apr 16 2019 nova_api/build_requests.ibd -rw-rw---- 1 mysql mysql 364M Apr 16 2019 nova_api/request_specs.ibd Can those be removed? Thanks Manuel

4 years, 1 month

IPv6 allocation for codfw1dev

by Arturo Borrero Gonzalez

Hi Arzhel, now that our BGP thing is mostly working, I think we can move forward with IPv6 for codfw1dev, which was my original intention before doing anything related to BGP :-P We would need to: * allocate a small IPv6 prefix for the transport network (between cr-codfw and the neutron virtual router). I'm open to suggestion to the size of this subnet. * allocate a big IPv6 prefix that we could use internally inside the cloud environment (for VMs). This should probably be a /48? I'm not sure yet if I would be doing subnetting of this big prefix and allocate smaller ones (/64) per project, or if we will be using the big one directly without sub-delegation. (Something to research anyway). I will follow-up next week. For the record, doing this research and design is one my goals/OKRs for this quarter (3 week left!). My plan is to use this wikitech page as my deliverable: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme… regards. -- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

4 years, 1 month

Replacing the scheduler_pool with host aggregates

by Andrew Bogott

The sudden arrival of the wdqs cloudvirts (T221631) has provided a very straightforward use case for nova host aggregates. We'd already been planning to adopt them at some point (T226731) so I've gone ahead and set some up today. Starting sometime soon (maybe tomorrow!) a host aggregate named 'standard' will replace the existing 'scheduler pool,' and the profile::openstack::eqiad1::nova::scheduler_pool: hiera key will vanish. That knowledge will instead live inside the nova database, and can be queried in a few ways, most simply with '# openstack aggregate show standard' I've done my best to document all this[0] but want to call out a few points: - We will no longer have git history explaining why a given cloudvirt is pooled or depooled. For that reason it is more important than ever to !log any change to aggregate membership. I propose we standardize on the !log admin SAL in -cloud rather than the production -operations SAL in for this. - In order to reduce the chances of losing track of a hypervisor entirely, I've created some tracking aggregates. If you remove a cloudvirt from the 'standard' aggregate, please re-assign it to 'maintenance', 'spare', or 'toobusy' as appropriate. That's it! If anyone really hates this please let me know and I can roll things back. I'm already regretting the name 'standard' but at least it's not badly overloaded like my first choice, 'public', is. [0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates

4 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin March 2020