As of 2024-03-14T11:02 UTC the Toolforge Grid Engine service has been
shutdown.[0][1]
This shutdown is the culmination of a final migration process from
Grid Engine to Kubernetes that started in in late 2022.[2] Arturo
wrote a blog post in 2022 that gives a detailed explanation of why we
chose to take on the final shutdown project at that time.[3] The roots
of this change go back much further however to at least August of 2015
when Yuvi Panda posted to the labs-l list about looking for more
modern alternatives to the Grid Engine platform.[4]
Some tools have been lost and a few technical volunteers have been
upset as many of us have striven to meet a vision of a more secure,
performant, and maintainable platform for running the many critical
tools hosted by the Toolforge project. I am deeply sorry to each of
you who have been frustrated by this change, but today I stand to
celebrate the collective work and accomplishment of the many humans
who have helped imagine, design, implement, test, document, maintain,
and use the Kubernetes deployment and support systems in Toolforge.
Thank you to the past and present members of the Wikimedia Cloud
Services team. Thank you to the past and present technical volunteers
acting as Toolforge admins. Thank you to the many, many Toolforge tool
maintainers who use the platform, ask for new capabilities, and help
each other make ever better software for the Wikimedia movement. Thank
you to the folks who who will keep moving the Toolforge project and
other technical spaces in the Wikimedia movement forward for many,
many years to come.
[0]: https://sal.toolforge.org/log/DrOgPI4BGiVuUzOd9I1b
[1]: https://wikitech.wikimedia.org/wiki/Obsolete:Toolforge/Grid
[2]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[3]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/
[4]: https://lists.wikimedia.org/pipermail/labs-l/2015-August/003955.html
Bryan, on behalf of the Toolforge administrators
--
Bryan Davis Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hello all,
We are on the last stretch of the grid engine deprecation process[0] and
this means that the grid will be shutting down on Thursday, the 14th of
March.
You can find a reminder of the full timeline here[1]
There's about 30 tools still running on the grid, if you are one of the few
left to migrate,
kindly ensure they are migrated before the 14th or reach out[2] to the team
if you are facing any challenges or need some assistance.
We have also reached out on phabricator and via email to the remaining
maintainers that still have their tools running on the grid to see if we
can help ease the migration or see if there are any blocking issues.
If you have a tool that is still on the grid and you cannot meet the above
deadline, kindly reach out via the tool migration phabricator ticket or our
support channels[2], note that this is a hard deadline and no extensions
would be granted but we might be able to help you do the transition.
We really appreciate all the effort and feedback given on the new platform,
this will help us improve our service and reduce the maintenance burden in
the long term for tool maintainers and toolforge admins alike.
[0]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[2]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi all!
Good news, we have enabled health checks for all the webservices running on
toolforge.
There's no action required on your part, the next time you restart or
stop/start your webservice, it will have a tcp health check by default (just
making sure something is listening).
The most interesting feature though is being able to pass a url to use as HTTP
health check.
To do so you can pass `--health-check-url /path/to/health` to your `toolforge
webservice start` command, and toolforge will automatically restart your
webservice if it stops responding to that path (you can change the path to
whatever you want, ex. `/`).
Note that this url will be queried quite often, so try to avoid hitting a page
that uses many resources.
Also a reminder that you can find this and smaller user-facing updates about
the Toolforge platform features here:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog
Original task: https://phabricator.wikimedia.org/T341919
Cheers!
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Hello,
Starting on Wednesday(14th February), selected tools will stop running on
Grid Engine.
The tools will be stopped from running but the code and data will not be
deleted.
If you want your tool to be re-enabled, please reach out to the cloud
admins on the mailing list or on the tool's migration ticket.
Those who have reached out to ask for extension are not affected by this.
Here's a reminder of the timeline we are following[0]
[0]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hello all!
We have been hard at work on our Graph Split experiment [1], and we now
have a working graph split that is loaded onto 3 test servers. We are
running tests on a selection of queries from our logs to help understand
the impact of the split. We need your help to validate the impact of
various use cases and workflows around Wikidata Query Service.
**What is the WDQS Graph Split experiment?**
We want to address the growing size of the Wikidata graph by splitting it
into 2 subgraphs of roughly half the size of the full graph, which should
support the growth of Wikidata for the next 5 years. This experiment is
about splitting the full Wikidata graph into a scholarly articles subgraph
and a “main” graph that contains everything else.
See our previous update for more details [2].
**Who should care?**
Anyone who uses WDQS through the UI or programmatically should check the
impact on their use cases, scripts, bots, code, etc.
**What are those test endpoints?**
We expose 3 test endpoints, for the full, main and scholarly articles
graphs. Those graphs are all created from the same dump and are not live
updated. This allows us to compare queries between the different endpoints,
with stable / non changing data (the data are from the middle of October
2023).
The endpoints are:
* https://query-full-experimental.wikidata.org/
* https://query-main-experimental.wikidata.org/
* https://query-scholarly-experimental.wikidata.org/
Each of the endpoints is backed by a single dedicated server of performance
similar to the production WDQS servers. We don’t expect performance to be
representative of production due to the different load and to the lack of
updates on the test servers.
**What kind of feedback is useful?**
We expect queries that don’t require scholarly articles to work
transparently on the “main” subgraph. We expect queries that require
scholarly articles to need to be rewritten with SPARQL federation between
the “main” and scholarly subgraphs (federation is supported for some
external SPARQL servers already [3], this just happens to be for internal
server-to-server communication). We are doing tests and analysis based on a
sample of query logs.
**We want to hear about:**
General use cases or classes of queries which break under federation
Bots or applications that need significant rewrite of queries to work with
federation
And also about use cases that work just fine!
Examples of queries and pointers to code will be helpful in your feedback.
**Where should feedback be sent?**
You can reach out to us using the project’s talk page [1], the Phabricator
ticket for community feedback [4] or by pinging directly Sannita (WMF) [5].
**Will feedback be taken into account?**
Yes! We will review feedback and it will influence our path forward. That
being said, there are limits to what is possible. The size of the Wikidata
graph is a threat to the stability of WDQS and thus a threat to the whole
Wikidata project. Scholarly articles is the only split we know of that
would reduce the graph size sufficiently. We can work together on providing
support for a migration, on reviewing the rules used for the graph split,
but we can’t just ignore the problem and continue with a WDQS that provides
transparent access to the full Wikidata graph.
Have fun!
Guillaume
[1]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
[2]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_up…
[3]
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Federation
[4] https://phabricator.wikimedia.org/T356773
[5] https://www.wikidata.org/wiki/User:Sannita_(WMF)
--
Guillaume Lederrey (he/him)
Engineering Manager
Wikimedia Foundation
Hello all,
As we continue with work towards the grid engine deprecation[0], we are
following through with the timeline[1] shared.
We have reached out to each maintainer via email and talk pages with
reminders.
On the 14th of this month(February), we will begin to stop tools that are
still running on the grid.
Tools that have had their maintainers reach out and are actively migrating,
can ask for extensions and will not be stopped.
Once a tool is stopped, if the maintainer has a clear plan for migrating,
they can request in the tool-specific migration task for the tool to be
re-enabled (although they will be shut down again if they miss the March
deadline).
If you have a tool that is still on the grid and you cannot meet the above
deadline, kindly reach out via the tool migration phabricator ticket or our
support channels[2]
[0]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[2]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Toolforge just now suffered a partial grid-engine outage. All grid
services should be back to normal as of this email; some k8s services
may misbehave for the next hour or two.
NFS misbehavior resulted in grid control mechanisms timing out, which
meant that no new jobs could be scheduled for the last 90 minutes or so.
We've rebooted the NFS server which has resolved the primary issues;
however, rebooting NFS is itself disruptive and may have caused other
jobs (both on the grid or in k8s) to fail.
We're currently rebooting all k8s worker nodes, which will take a couple
of hours to complete. During those reboots some jobs may fail or
experience surprise rescheduling.
Sorry for the outage! If your grid job was disrupted by this outage,
please take this as a sign to migrate your service off the grid! Details
about the grid shutdown can be found here:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
-Andrew (+ Taavi who did most of the actual recovery work)
Hello!
After our initial announcement of the Grid Engine shutdown timeline[0],
some of you raised concerns about losing your tools.
We want to address those apprehensions while hopefully providing
reassurance. No tools will be deleted until the grid engine shutdown date
on 14 February 2023. However, for tools with unreachable maintainers, an
outage will happen starting on 14 December 2023[1]. This is intended to
raise awareness for users or maintainers who have not otherwise been
reached. A list of these tools can be found here[2]. If you are a
maintainer or a user of a tool in this list, comment on the associated
phabricator ticket with migration plans or a request for more support. The
goal is to have a plan for all tools running on the grid. We want all
actively used tools to be migrated, and will help support users of critical
tools without a maintainer. Thanks for your help in identifying and
migrating those tools you maintain and depend on.
We acknowledge that the timeline might seem tight, and we want to clarify
that our approach is to make this process as seamless as possible. We have
been actively engaging with tool maintainers over the past year, and we
genuinely appreciate the efforts many of you have already made to migrate
your tools to Kubernetes.
We will continue to work closely with maintainers who might need additional
time or assistance.
If for any reason you have not received a phabricator ticket for your tool,
please reach out.
The phabricator ticket is a good place to communicate your needs and plans
for any remaining tools or jobs.
This will help us further organize and plan this process.
Our primary goal is to support you through this transition. If you have
further concerns about the deadline or if you need assistance with the
migration process, please don't hesitate to reach out to us. We are
available on IRC, Telegram, Phabricator[3], and through our other support
channels[4].
Do you still have concerns or questions? Please let us know. We want to do
this together with you, in a way which makes sense to everyone. We’re very
grateful for all the hard work you do, and our only goal here is to secure
the future of tools in the Wikimedia sphere, not to make your lives more
difficult.
Thank you!
[0]:
https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.…
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[2]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation/…
[3]: https://phabricator.wikimedia.org/project/board/6135/
[4]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
We are experiencing networking issues on Cloud VPS, which means
currently no traffic is getting in or out of Cloud VPS. Toolforge is
also down.
We are working on it and progress is tracked at
https://phabricator.wikimedia.org/T352539
We will send an update when things are working again, thanks for your patience.
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
Later today, I am upgrading our OpenStack deployment from version Zed to
Antelope. [1]
Expect Cloud VPS to be partially unstable: horizon.wikimedia.org will show
a maintenance message and API calls might fail.
You can follow the upgrade details at
https://phabricator.wikimedia.org/T348843 and on IRC
(#wikimedia-cloud-admin).
[1] https://releases.openstack.org/antelope/
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation