Hello,
we have an issue with Jira authentication since 25th August.
It seems the syncronisation with the crowd server is broken but I dont know why and filed a bug at Atlassian.
Cheers
Marlen
Hi,
from about 3:00Z to about 3:20Z, no login was possible to
nightshade and yarrow, (not existing) passwords required for
willow and the webserver returned 404s. MZMcBride had an
open session into willow, and loads of accessible servers
were within limits
(cf. http://p.defau.lt/?e_zsJIW_rAbfR3Cvlvx9Uw), but reserve
lookup of user names was broken
(cf. http://p.defau.lt/?asmBijtXnvzQacz1e8JXOQ) and
ldapsearch timed out as well
(cf. http://p.defau.lt/?P47PCC3_1d3mnoLyVqFUqQ). This looks
like a failure of the LDAP server.
Two other issues surfaced at that time:
- http://nagios.toolserver.org/ gave 500s during the outage.
I asked Coren to consult with WMF if there are possibili-
ties to outsource (or integrate :-)) this monitoring to
their existing infrastructure
(http://icinga.wikimedia.org/).
- The listed mail address for the Toolserver admins is
ts-admins(a)toolserver.org. While this may work during such
an outage (I didn't try) and personal mail addresses for
admins can be found in the toolserver-announce archives,
we should prefer an address routed externally, and trying
not to be too imaginative I propose:
ts-admins(a)wikimedia.de.
Tim
Hello all,
as you may notice I was not online yesterday and today. The reason is that I
have way more to do in real-life at the moment and a flu is visiting my family
at the moment. For these reasons I will not be online as much as normal this
week (maybe it will get better at the weekend). If something VERY urgent
happens please send me a mail and I will look at it when I find time.
As you also may noticed is that sql-s5-user is slower than normal. The reason
is simple: I import commons in parallel threads to have it available as soon
as possible. If you need a fast and not much behind copy of s5 for READING use
sql-s5-rr (you should ALWAYS use that or dewiki-p.rrdb.toolserver.org for
reading).
Hope to see you soon.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
<futurama>Good news, everyone!</futurama>
As many of you know, I officially started my duties today as the WMF
Operations Engineer attached to the Tool Labs. I intend to make a point
of informing all of you of recent news, what I'm working on, and where
I'm headed at regular intervals (probably weekly).
First, a bit of news: I have had confirmation this weekend that the DB
replication made available to Tool Labs users will, in fact, allow the
creation of databases alongside the project ones. This means that one
of the use cases that seemed the most troublesome in the transition
(joins between the WMF databases and tool-specific ones) will be fully
supported.
We are making good strides in documenting the *impressive* inventory of
tools that run on toolserver and their requirements (thanks, Silke!).
The list-in-progress can be found at [1]. If you see missing or
incorrect information, please feel free to adjust it -- the more
precisely we know the requirements, the faster we can see about meeting
them.
I've started documenting my preleminary design for the shiny new Tool
Labs infrastructure at [2]. This is a living document, and will see a
great deal of revision before it's over (and will serve as the seed for
the documentation). I will shortly create a new Labs project where that
architecture is deployed in preproduction so we can shake out the
kinks. The existing projects, "bots" and "webtools" will be left active
for the forseeable future until (a) the new architecture has proven
itself and (b) every user has sucessfuly moved their tools to it.
I'm planning on having the new project be fully operational for new
tools by the time the Amsterdam Hackathon takes place at the end of May
at the very least.
For the next week, I'll be mostly in information-gathering mode, as well
as refining the design and requirements of the Tool Labs. Feel free to
poke me for information (or /with/ information) by email or on IRC
(where I am user 'Coren' and idle on #wikimedia-labs and
#wikimedia-toolserve at the very least)
-- Marc A. Pelletier
[1] http://www.mediawiki.org/wiki/Toolserver/List_of_Tools
[2] http://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Design
Hello all,
for a important kernel-update I need to reboot the linux-boxes today. The
reboot will happen
TODAY, 20:00 UTC.
The linux userland-boxes will be away for ~10min and the database-servers
(sql-s2 and sql-s2) for 30min (all values are estimates). Solaris-boxes are
not affected. The reboots will happen sequentially so SGE should re-schedule
tasks between the boxes so the downtime for each task should be short.
You can follow the progress at [1].
Sincerely,
DaB.
[1] https://jira.toolserver.org/browse/MNT-1297
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Hello all,
I just started replication on the fresh dump of s5 and will import commons
later tonight. For tomorrow I plan a second try to move the user-databases
from the old s5-host to the new one. So I hereby announce a read-only-time of
s5 for
TOMORROW, 21:00 UTC
of unknown length (should take not that long because the owner of the biggest
databases contacted me telling that no movement is needed) – at minimum a few
hours.
I will also dump wikidata from this fresh dump and import it everywhere during
the next hours (so a correct wikidata-copy should be everywhere soon again).
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Hello all,
to (hopefully) speed up s2 I need to restart mysql to bring some config-changes
live. Because of this there will be a downtime for sql-s2 starting
TOMORROW, 22:00 UTC.
The downtime should be less than 1h. You can follow the progress at [1].
Sincerely,
DaB.
[1] https://jira.toolserver.org/browse/MNT-1296
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Hello all,
during the maintenance window yesterday evening the hole cluster was down for
~30min starting ~21:20 UTC. The problem was independent of the maintenance
working, but caused the window to extend.
The problem was an out-of-memory on one of our HA-nodes. Unfortunately the box
did not restart itself and its ha-buddy did not detect the problem too, so the
services of the out-of-memory-box were not switched to the other box. This
caused the hole cluster to stand until I manually rebooted the host. I will
look if I can find some kind of sensor for that; in worst case I will enable
our old "reboot if low on memory"-script again.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
There have been multiple reports of corruption and at least 4 open
tickets in JIRA about issues with enwiki's database issues going back
about a year. The most notable corruption can be seen in the
user_editcount field. On some users I have seen it as far as 500 edits
higher than their actual count... Not sure how that is occurring.