- Cloud-announce - lists.wikimedia.org

[Toolforge][GRID SHUTDOWN] Toolforge Grid Engine has been shutdown

by Bryan Davis

As of 2024-03-14T11:02 UTC the Toolforge Grid Engine service has been shutdown.[0][1] This shutdown is the culmination of a final migration process from Grid Engine to Kubernetes that started in in late 2022.[2] Arturo wrote a blog post in 2022 that gives a detailed explanation of why we chose to take on the final shutdown project at that time.[3] The roots of this change go back much further however to at least August of 2015 when Yuvi Panda posted to the labs-l list about looking for more modern alternatives to the Grid Engine platform.[4] Some tools have been lost and a few technical volunteers have been upset as many of us have striven to meet a vision of a more secure, performant, and maintainable platform for running the many critical tools hosted by the Toolforge project. I am deeply sorry to each of you who have been frustrated by this change, but today I stand to celebrate the collective work and accomplishment of the many humans who have helped imagine, design, implement, test, document, maintain, and use the Kubernetes deployment and support systems in Toolforge. Thank you to the past and present members of the Wikimedia Cloud Services team. Thank you to the past and present technical volunteers acting as Toolforge admins. Thank you to the many, many Toolforge tool maintainers who use the platform, ask for new capabilities, and help each other make ever better software for the Wikimedia movement. Thank you to the folks who who will keep moving the Toolforge project and other technical spaces in the Wikimedia movement forward for many, many years to come. [0]: https://sal.toolforge.org/log/DrOgPI4BGiVuUzOd9I1b [1]: https://wikitech.wikimedia.org/wiki/Obsolete:Toolforge/Grid [2]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… [3]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/ [4]: https://lists.wikimedia.org/pipermail/labs-l/2015-August/003955.html Bryan, on behalf of the Toolforge administrators -- Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

2 months, 1 week

1
0
0 0

[GRID SHUTDOWN] Grid Engine Is Shutting Down On The 14th Of March

by Seyram Komla Sapaty

Hello all, We are on the last stretch of the grid engine deprecation process[0] and this means that the grid will be shutting down on Thursday, the 14th of March. You can find a reminder of the full timeline here[1] There's about 30 tools still running on the grid, if you are one of the few left to migrate, kindly ensure they are migrated before the 14th or reach out[2] to the team if you are facing any challenges or need some assistance. We have also reached out on phabricator and via email to the remaining maintainers that still have their tools running on the grid to see if we can help ease the migration or see if there are any blocking issues. If you have a tool that is still on the grid and you cannot meet the above deadline, kindly reach out via the tool migration phabricator ticket or our support channels[2], note that this is a hard deadline and no extensions would be granted but we might be able to help you do the transition. We really appreciate all the effort and feedback given on the new platform, this will help us improve our service and reduce the maintenance burden in the long term for tool maintainers and toolforge admins alike. [0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… [2]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun… -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

2 months, 2 weeks

1
0
0 0

[toolforge] Health checks available for webservices

by David Caro

Hi all! Good news, we have enabled health checks for all the webservices running on toolforge. There's no action required on your part, the next time you restart or stop/start your webservice, it will have a tcp health check by default (just making sure something is listening). The most interesting feature though is being able to pass a url to use as HTTP health check. To do so you can pass `--health-check-url /path/to/health` to your `toolforge webservice start` command, and toolforge will automatically restart your webservice if it stops responding to that path (you can change the path to whatever you want, ex. `/`). Note that this url will be queried quite often, so try to avoid hitting a page that uses many resources. Also a reminder that you can find this and smaller user-facing updates about the Toolforge platform features here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog Original task: https://phabricator.wikimedia.org/T341919 Cheers! -- David Caro SRE - Cloud Services Wikimedia Foundation <https://wikimediafoundation.org/> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment."

2 months, 3 weeks

1
1
0 0

[GRID MIGRATION] Selected Tools Will Stop Running On The Grid

by Seyram Komla Sapaty

Hello, Starting on Wednesday(14th February), selected tools will stop running on Grid Engine. The tools will be stopped from running but the code and data will not be deleted. If you want your tool to be re-enabled, please reach out to the cloud admins on the mailing list or on the tool's migration ticket. Those who have reached out to ask for extension are not affected by this. Here's a reminder of the timeline we are following[0] [0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

3 months, 1 week

1
0
0 0

Wikidata Query Service - Scaling Update - February 2024

by Guillaume Lederrey

Hello all! We have been hard at work on our Graph Split experiment [1], and we now have a working graph split that is loaded onto 3 test servers. We are running tests on a selection of queries from our logs to help understand the impact of the split. We need your help to validate the impact of various use cases and workflows around Wikidata Query Service. **What is the WDQS Graph Split experiment?** We want to address the growing size of the Wikidata graph by splitting it into 2 subgraphs of roughly half the size of the full graph, which should support the growth of Wikidata for the next 5 years. This experiment is about splitting the full Wikidata graph into a scholarly articles subgraph and a “main” graph that contains everything else. See our previous update for more details [2]. **Who should care?** Anyone who uses WDQS through the UI or programmatically should check the impact on their use cases, scripts, bots, code, etc. **What are those test endpoints?** We expose 3 test endpoints, for the full, main and scholarly articles graphs. Those graphs are all created from the same dump and are not live updated. This allows us to compare queries between the different endpoints, with stable / non changing data (the data are from the middle of October 2023). The endpoints are: * https://query-full-experimental.wikidata.org/ * https://query-main-experimental.wikidata.org/ * https://query-scholarly-experimental.wikidata.org/ Each of the endpoints is backed by a single dedicated server of performance similar to the production WDQS servers. We don’t expect performance to be representative of production due to the different load and to the lack of updates on the test servers. **What kind of feedback is useful?** We expect queries that don’t require scholarly articles to work transparently on the “main” subgraph. We expect queries that require scholarly articles to need to be rewritten with SPARQL federation between the “main” and scholarly subgraphs (federation is supported for some external SPARQL servers already [3], this just happens to be for internal server-to-server communication). We are doing tests and analysis based on a sample of query logs. **We want to hear about:** General use cases or classes of queries which break under federation Bots or applications that need significant rewrite of queries to work with federation And also about use cases that work just fine! Examples of queries and pointers to code will be helpful in your feedback. **Where should feedback be sent?** You can reach out to us using the project’s talk page [1], the Phabricator ticket for community feedback [4] or by pinging directly Sannita (WMF) [5]. **Will feedback be taken into account?** Yes! We will review feedback and it will influence our path forward. That being said, there are limits to what is possible. The size of the Wikidata graph is a threat to the stability of WDQS and thus a threat to the whole Wikidata project. Scholarly articles is the only split we know of that would reduce the graph size sufficiently. We can work together on providing support for a migration, on reviewing the rules used for the graph split, but we can’t just ignore the problem and continue with a WDQS that provides transparent access to the full Wikidata graph. Have fun! Guillaume [1] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split [2] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_up… [3] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Federation [4] https://phabricator.wikimedia.org/T356773 [5] https://www.wikidata.org/wiki/User:Sannita_(WMF) -- Guillaume Lederrey (he/him) Engineering Manager Wikimedia Foundation

3 months, 2 weeks

1
0
0 0

[IMPORTANT] Tools On Grid Engine Will Be Stopped Soon

by Seyram Komla Sapaty

Hello all, As we continue with work towards the grid engine deprecation[0], we are following through with the timeline[1] shared. We have reached out to each maintainer via email and talk pages with reminders. On the 14th of this month(February), we will begin to stop tools that are still running on the grid. Tools that have had their maintainers reach out and are actively migrating, can ask for extensions and will not be stopped. Once a tool is stopped, if the maintainer has a clear plan for migrating, they can request in the tool-specific migration task for the tool to be re-enabled (although they will be shut down again if they miss the March deadline). If you have a tool that is still on the grid and you cannot meet the above deadline, kindly reach out via the tool migration phabricator ticket or our support channels[2] [0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… [2]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun… -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

3 months, 2 weeks

1
0
0 0

partial Toolforge outage

by Andrew Bogott

Toolforge just now suffered a partial grid-engine outage. All grid services should be back to normal as of this email; some k8s services may misbehave for the next hour or two. NFS misbehavior resulted in grid control mechanisms timing out, which meant that no new jobs could be scheduled for the last 90 minutes or so. We've rebooted the NFS server which has resolved the primary issues; however, rebooting NFS is itself disruptive and may have caused other jobs (both on the grid or in k8s) to fail. We're currently rebooting all k8s worker nodes, which will take a couple of hours to complete. During those reboots some jobs may fail or experience surprise rescheduling. Sorry for the outage! If your grid job was disrupted by this outage, please take this as a sign to migrate your service off the grid! Details about the grid shutdown can be found here: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… -Andrew (+ Taavi who did most of the actual recovery work)

5 months, 1 week

1
0
0 0

Assurances and Support Regarding the Grid Engine Shutdown

by Seyram Komla Sapaty

Hello! After our initial announcement of the Grid Engine shutdown timeline[0], some of you raised concerns about losing your tools. We want to address those apprehensions while hopefully providing reassurance. No tools will be deleted until the grid engine shutdown date on 14 February 2023. However, for tools with unreachable maintainers, an outage will happen starting on 14 December 2023[1]. This is intended to raise awareness for users or maintainers who have not otherwise been reached. A list of these tools can be found here[2]. If you are a maintainer or a user of a tool in this list, comment on the associated phabricator ticket with migration plans or a request for more support. The goal is to have a plan for all tools running on the grid. We want all actively used tools to be migrated, and will help support users of critical tools without a maintainer. Thanks for your help in identifying and migrating those tools you maintain and depend on. We acknowledge that the timeline might seem tight, and we want to clarify that our approach is to make this process as seamless as possible. We have been actively engaging with tool maintainers over the past year, and we genuinely appreciate the efforts many of you have already made to migrate your tools to Kubernetes. We will continue to work closely with maintainers who might need additional time or assistance. If for any reason you have not received a phabricator ticket for your tool, please reach out. The phabricator ticket is a good place to communicate your needs and plans for any remaining tools or jobs. This will help us further organize and plan this process. Our primary goal is to support you through this transition. If you have further concerns about the deadline or if you need assistance with the migration process, please don't hesitate to reach out to us. We are available on IRC, Telegram, Phabricator[3], and through our other support channels[4]. Do you still have concerns or questions? Please let us know. We want to do this together with you, in a way which makes sense to everyone. We’re very grateful for all the hard work you do, and our only goal here is to secure the future of tools in the Wikimedia sphere, not to make your lives more difficult. Thank you! [0]: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.… [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… [2]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation/… [3]: https://phabricator.wikimedia.org/project/board/6135/ [4]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun… -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

5 months, 2 weeks

1
0
0 0

Cloud VPS Network Outage

by Francesco Negri

We are experiencing networking issues on Cloud VPS, which means currently no traffic is getting in or out of Cloud VPS. Toolforge is also down. We are working on it and progress is tracked at https://phabricator.wikimedia.org/T352539 We will send an update when things are working again, thanks for your patience. -- Francesco Negri (he/him) -- IRC: dhinus Site Reliability Engineer, Cloud Services team Wikimedia Foundation

5 months, 3 weeks

1
1
0 0

Cloud VPS maintenance: OpenStack upgrade

by Francesco Negri

Later today, I am upgrading our OpenStack deployment from version Zed to Antelope. [1] Expect Cloud VPS to be partially unstable: horizon.wikimedia.org will show a maintenance message and API calls might fail. You can follow the upgrade details at https://phabricator.wikimedia.org/T348843 and on IRC (#wikimedia-cloud-admin). [1] https://releases.openstack.org/antelope/ -- Francesco Negri (he/him) -- IRC: dhinus Site Reliability Engineer, Cloud Services team Wikimedia Foundation

5 months, 3 weeks

1
1
0 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce