Toolserver-l February 2013

toolserver-l@lists.wikimedia.org

36 participants
34 discussions

by Marlen Caemmerer

Hello, we have an issue with Jira authentication since 25th August. It seems the syncronisation with the crowd server is broken but I dont know why and filed a bug at Atlassian. Cheers Marlen

11 years, 1 month

Possible LDAP outtime this morning, major disruption

by Tim Landscheidt

Hi, from about 3:00Z to about 3:20Z, no login was possible to nightshade and yarrow, (not existing) passwords required for willow and the webserver returned 404s. MZMcBride had an open session into willow, and loads of accessible servers were within limits (cf. http://p.defau.lt/?e_zsJIW_rAbfR3Cvlvx9Uw), but reserve lookup of user names was broken (cf. http://p.defau.lt/?asmBijtXnvzQacz1e8JXOQ) and ldapsearch timed out as well (cf. http://p.defau.lt/?P47PCC3_1d3mnoLyVqFUqQ). This looks like a failure of the LDAP server. Two other issues surfaced at that time: - http://nagios.toolserver.org/ gave 500s during the outage. I asked Coren to consult with WMF if there are possibili- ties to outsource (or integrate :-)) this monitoring to their existing infrastructure (http://icinga.wikimedia.org/). - The listed mail address for the Toolserver admins is ts-admins(a)toolserver.org. While this may work during such an outage (I didn't try) and personal mail addresses for admins can be found in the toolserver-announce archives, we should prefer an address routed externally, and trying not to be too imaginative I propose: ts-admins(a)wikimedia.de. Tim

11 years, 2 months

Less time this week and slower sql-s5-user

by DaB.

Hello all, as you may notice I was not online yesterday and today. The reason is that I have way more to do in real-life at the moment and a flu is visiting my family at the moment. For these reasons I will not be online as much as normal this week (maybe it will get better at the weekend). If something VERY urgent happens please send me a mail and I will look at it when I find time. As you also may noticed is that sql-s5-user is slower than normal. The reason is simple: I import commons in parallel threads to have it available as soon as possible. If you need a fast and not much behind copy of s5 for READING use sql-s5-rr (you should ALWAYS use that or dewiki-p.rrdb.toolserver.org for reading). Hope to see you soon. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

11 years, 2 months

Web services and SSH down?

by DeltaQuad Wikipedia

Hi all, I'm currently getting connection timeouts on HTTP (at the pages listed below, without the secure part), and 404s on HTTPS on TS pages such as: https://toolserver.org/~unblock/p/appeal.php https://toolserver.org/~acc/acc.php https://toolserver.org/~snottywong/index.html https://toolserver.org/~helloannyong/range/ On SSH to willow and nightshade, my key doesn't work. No errors, it just thinks for a long time after I input my username and then may or may not request a keyboard interactive password. If I don't get asked for a password, after about 1min30sec I get a connection timeout error. DeltaQuad English Wikipedia Administrator and Checkuser

11 years, 2 months

Hello, and the short term plans

by Marc A. Pelletier

<futurama>Good news, everyone!</futurama> As many of you know, I officially started my duties today as the WMF Operations Engineer attached to the Tool Labs. I intend to make a point of informing all of you of recent news, what I'm working on, and where I'm headed at regular intervals (probably weekly). First, a bit of news: I have had confirmation this weekend that the DB replication made available to Tool Labs users will, in fact, allow the creation of databases alongside the project ones. This means that one of the use cases that seemed the most troublesome in the transition (joins between the WMF databases and tool-specific ones) will be fully supported. We are making good strides in documenting the *impressive* inventory of tools that run on toolserver and their requirements (thanks, Silke!). The list-in-progress can be found at [1]. If you see missing or incorrect information, please feel free to adjust it -- the more precisely we know the requirements, the faster we can see about meeting them. I've started documenting my preleminary design for the shiny new Tool Labs infrastructure at [2]. This is a living document, and will see a great deal of revision before it's over (and will serve as the seed for the documentation). I will shortly create a new Labs project where that architecture is deployed in preproduction so we can shake out the kinks. The existing projects, "bots" and "webtools" will be left active for the forseeable future until (a) the new architecture has proven itself and (b) every user has sucessfuly moved their tools to it. I'm planning on having the new project be fully operational for new tools by the time the Amsterdam Hackathon takes place at the end of May at the very least. For the next week, I'll be mostly in information-gathering mode, as well as refining the design and requirements of the Tool Labs. Feel free to poke me for information (or /with/ information) by email or on IRC (where I am user 'Coren' and idle on #wikimedia-labs and #wikimedia-toolserve at the very least) -- Marc A. Pelletier [1] http://www.mediawiki.org/wiki/Toolserver/List_of_Tools [2] http://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Design

11 years, 2 months

Reboot of the linux-boxes today

by DaB.

Hello all, for a important kernel-update I need to reboot the linux-boxes today. The reboot will happen TODAY, 20:00 UTC. The linux userland-boxes will be away for ~10min and the database-servers (sql-s2 and sql-s2) for 30min (all values are estimates). Solaris-boxes are not affected. The reboots will happen sequentially so SGE should re-schedule tasks between the boxes so the downtime for each task should be short. You can follow the progress at [1]. Sincerely, DaB. [1] https://jira.toolserver.org/browse/MNT-1297 -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

11 years, 2 months

2. Try: sql-s5 (dewiki) will be read-only tomorrow

by DaB.

Hello all, I just started replication on the fresh dump of s5 and will import commons later tonight. For tomorrow I plan a second try to move the user-databases from the old s5-host to the new one. So I hereby announce a read-only-time of s5 for TOMORROW, 21:00 UTC of unknown length (should take not that long because the owner of the biggest databases contacted me telling that no movement is needed) – at minimum a few hours. I will also dump wikidata from this fresh dump and import it everywhere during the next hours (so a correct wikidata-copy should be everywhere soon again). Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

11 years, 2 months

Short downtime of s2 tomorrow

by DaB.

Hello all, to (hopefully) speed up s2 I need to restart mysql to bring some config-changes live. Because of this there will be a downtime for sql-s2 starting TOMORROW, 22:00 UTC. The downtime should be less than 1h. You can follow the progress at [1]. Sincerely, DaB. [1] https://jira.toolserver.org/browse/MNT-1296 -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

11 years, 2 months

Postmortem: General downtime yesterday

by DaB.

Hello all, during the maintenance window yesterday evening the hole cluster was down for ~30min starting ~21:20 UTC. The problem was independent of the maintenance working, but caused the window to extend. The problem was an out-of-memory on one of our HA-nodes. Unfortunately the box did not restart itself and its ha-buddy did not detect the problem too, so the services of the out-of-memory-box were not switched to the other box. This caused the hole cluster to stand until I manually rebooted the host. I will look if I can find some kind of sensor for that; in worst case I will enable our old "reboot if low on memory"-script again. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

11 years, 2 months

enwiki_p

by John

There have been multiple reports of corruption and at least 4 open tickets in JIRA about issues with enwiki's database issues going back about a year. The most notable corruption can be seen in the user_editcount field. On some users I have seen it as far as 500 edits higher than their actual count... Not sure how that is occurring.

11 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Toolserver-l February 2013