[Labs-l] A (21) day in the Labs

Marc A. Pelletier marc at uberbox.org
Mon Apr 8 14:30:53 UTC 2013


<Tron>Greetings, Programs!</Tron>

So, not many updates for a while, as things have been progressing at a
fair clip in the "oh, my god boring gruntwork" front.

The biggest news is the addition of Petr Bena to the tools project
sysadmin team as its first volunteer.  Petr has been very involved in
the setup and administration of the Tool Labs' predecessor projects, and
will continue to steer the bots project where the rules are a little
more relaxed to facilitate more experimental development.

He's also joining me on the tools project proper, to help provide
support to maintainers over a wider range of times, and to increase
availability of sysadmins.  You can find him hanging around
#wikimedia-labs, often at times where I am not available.

There is some documentation-in-progress that give a lot of information
on how to set up your tools on the Labs architecture at:

https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Help

Please don't hesitate to comment if you see missing information, or if
parts of it are less clear than idea.

On the other fronts, the wikitech management interface is now in place
for self-serve of tool account creation by Labs users; this requires
moving already-existing tool accounts to the new scheme, and a brief
outage for that purpose later this week (see note below)

Experiments with a bulletproof replacement for gluster are well on their
way; with NFS from a highly redundant server as the currently favored
option.  With a bit of luck, I'll use the opportunity given by the
outage for the tool account switchover to move the shared tools
filesystem to NFS as a trial run.

The database replication is also well on its way; you can find the
current roadmap at:

https://wikitech.wikimedia.org/wiki/Tool_Labs/Database_plan

=== Planned outage ===

In order to move the extant tool accounts to the new, final scheme, and
(progress permitting) move the shared filesystems to a new storage
server, there will be a brief outage of the Tool Labs infrastructure
this Thursday April 11 starting at 16:00 UTC.  The outage is expected to
last 20 minutes during which service will be intermittently unavailable.

Announcements will be sent by email, on IRC and on the servers 30
minutes before the start of maintenance, at its start, and upon completion.


Impact:

* Jobs running on the grid engine will be stopped then restarted
automatically at the end of the maintenance window.  If you are running
a job that cannot or should not be restarted automatically without
intervention from its maintainers, please make certain that it has been
stopped before the start of the maintenance window;
* The login server will be restarted during the window, ending active
sessions;
* The web service will be intermittently unavailable; and
* Running processes not scheduled through the grid engine will be killed.

Recovery plan:

In case of unplanned failure during the maintenance window,
configuration will be rolled back to the current version and a new
window will be planned after postmortem.  Disruption of services will
take place as noted and an announcement will be sent.


-- Marc



More information about the Labs-l mailing list