Hello all,
like announced in July [1] the default arch of SGE will switch soon from
solaris to *. Originally that should happen at 1. October, but I forgot
the re-announce it and I'm sure some of you forgot about it too. So the switch
is hereby announced for
9. October, 20:00 UTC.
Jobs that are executed after this timestamp and has no arch-option will run on
any host (linux or solaris) instead of a solaris-host.
There are 4 ways for you to prepare (sorted):
-Make sure that you program runs on linux AND solaris, add "arch=*".
-Make sure that you program runs only on linux, add "arch=lx".
-Make sure that you program runs only on solaris, add "arch=sol".
-Do nothing, pray and see things break.
(Somehow I have the feeling most of you will choose 4…, please make me wrong).
Sincerely,
DaB.
[1] http://lists.wikimedia.org/pipermail/toolserver-announce/2012-
July/000506.html
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
like announced on last Sunday I hereby announce a maintenance-window for
Monday, 20:00-22:00 UTC for the web-servers.
I will reboot hemlock a few times to try to find out why the web-servers are
not working if hemlock is away (and if I find it, I will fix it). All web-tools
will failing in times when hemlock is (re-)booting, other sub-systems (like
SGE) should working normal.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello,
as tomorrow is maintenance window anyway I will add more disk space to s2 and s5.
In the time of the work the databases s2 and s5 will not be available.
This will take about 1-1.5 hours and I will do it when DaB checks the hemlock & web server interaction
at 20 - 22 UTC.
Cheers
nosy
Hello all,
because of a kernel-upgrade I have to reboot our linux-boxes (nightshade,
yarrow and mayapple). This will happen tomorrow,
Monday, 19:05 UTC.
I will reboot the boxes one after the other, each reboot should not take more
than 10 minutes. If you use SGE (like you should) your task will either
migrate to another box or restarted automatically. If you have files open (like
in a editor), you should close them.
You can follow the process at [1].
Sincerely,
DaB.
[1] https://jira.toolserver.org/browse/MNT-1268
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
to have something non-meta: I restarted mysql on z-dat-s3-a to de-swap
hyacinth. sql-s3 was away for 1.5h because the shutdown was very slow.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello,
At Sunday 23 September 2012 21:01:27 DaB. wrote:
> Hello,
>
> At Sunday 23 September 2012 20:30:29 DaB. wrote:
> > Since about an hour the web servers appear to be unresponsive:
> >
> > * http://ortelius.toolserver.org/~cvn/index.html
> > * http://wolfsbane.toolserver.org/~cvn/index.html
> > * https://toolserver.org/~cvn/index.html
> >
> > All error out on with no response and a time out.
> >
> > I can still SSH into wolfsbane and ortelius from willow, though.
>
> I will now investigate this. Until now the only problem I found is that
> hemlock is down.
I restored the web-access now. As far as I see hemlock lost its external array
and became out of memory around 2:30 UTC. I have no idea why this influence our
webserver. I rebooted hemlock to free the memory and restarted the webserver
on ortelius and wolfsbane; the webpages are back AFAIS.
What is not working at the moment is the user-store and our backup, because
both are on the external array of hemlock. Also not working is munin, which is
handled by hemlock. I will try to fix all this, but I guess I need nosy for
that (and in the worst case Mark in the colo).
>
> Sincerely,
> DaB.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
---------- Weitergeleitete Nachricht ----------
Betreff: Re: [Toolserver-l] z-dat-s4-a (s4-user) is down (was: Re: Reboot of
hyacinth s3/s6/s7)
Datum: Mittwoch 19 September 2012
Von: Marlen Caemmerer <marlen.caemmerer(a)wikimedia.de>
An: Wikimedia Toolserver <toolserver-l(a)lists.wikimedia.org>
Hello,
I had a bad accident with resizing the volume for s4-user.
Unfortunatelly I did not realize s4-rr does not hold the s4-user-databases
already.
I installed the backup of the user databases in this instance so s4-user
should be usable again.
Cheers
nosy
_______________________________________________
Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette
-------------------------------------------------------------
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
Nosy rebooted hyacinth this morning (see below). AFAIS something went wrong
with the sql-partition of s4, but I have no details yet. I have to speak with
Nosy first; until than sql-s4-user is down.
sql-s4-rr is operating normal.
Sincerely,
DaB.
At Tuesday 18 September 2012 16:16:51 DaB. wrote:
> Hello,
>
> I will reboot the database server hyacinth which holds s3, s6 and s7,
> tomorrow at 6:30 UTC.
>
> Cheers
> nosy
>
>
> _______________________________________________
> Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list:
> https://wiki.toolserver.org/view/Mailing_list_etiquette
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
a few users contacted me about their not running cron-tasks. A often found
problem is, that the cron-lines of these user are like the following:
0 0 * * * DoSomething
or
0 * * * * DoSomething
In a ideal world that would be no problem, but in real world that CAN be a
problem. Why? Because many users have the same idea and our submit-hosts fail
than with
(CRON) CAN'T FORK (child_process): Not enough space.
Last night 41 tasks were successful started at midnight, an unknown number
failed.
Of course we could just hit the problem with buying new hardware, but most
time of the day these hosts do idle.
So how to solve this problem? It's easy: Spread the load. Most times a task
(like a bot) do not care if it is started a few minutes earlier or later. So
choose a minute that is unlike 0 and not divisible without remainder by 5.
If it really does not matter for you when your task starts, then take the
position of the first letter of your user-name and add 2 ("dab" → "d" → 4 → 6).
To not produce a misunderstanding: If your task REALLY needs to start at
minute 0 (or at midnight): do it. An of course cron-tasks are failing for
other reasons to, so contact me (jira-bug preferred) if you have a problem.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
as most of you know, sql.toolserver.org (short: sql) is the place where you
should store your user-database if you need no joining with wmf-databases. sql
points to adenia at the moment. Adenia is quite busy from time to time and
sooner or later it will become overloaded. My plan is to not just buy a bigger
box than, but to buy another box and split sql (so some databases will be on
the old box and some will be on the new box).
The problem is that this is not possible with our current setup, where we have
only sql.toolserver.org – we can not configure it as round-robin because in 50%
of time you would miss your database.
The solution is simple, but it needs a little help from your side. I created
new DNS-names in the form of "sql-user-X" where X is a letter ("sql-user-a"
for example). The idea is now that you use not longer "sql", but "sql-user-X"
where "X" is the first letter of your user-name (so the user "erik" use "sql-
user-e" and the user "snowolf" use "sql-user-s"). If you all do this, it will
be simple for the roots to move user-databases away from adenia to another
server (for example we could move databases from "u_m*" till "u_z*" to the new
box and nothing would break).
I know that many of you need some time to update your tools. That's the reason
I announce that now where you have plenty of time to update your stuff. At the
moment sql-user-X points to adenia so nothing will break if you update now.
Please send questions to the mailing-list. I will update the wiki-pages soon.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885