[Labs-l] Department of the reporting of statuses

Mon Apr 22 16:55:47 UTC 2013

<optimism>What's the worst that could happen?</optimism>

The big news this week is the replacement of Gluster as the provider of
shared storage by NFS for the tools project.

In doing so, we're going to gain a great deal of extra stability, and
quite a bit of performance.  In addition, there is now an automatic
timetravel snapshot feature allowing users to look at the filesystem as
it was during snapshots spanning: once an hour for the past three hours,
once a day for the past three days, and two weekly snapshots (on Sundays).

That said, NFS is robust but does not scale as much as we would like in
the longer term; we will keep investigating clustered storage solutions
for the future, with an eye to returning to it once we find a solution
that is (a) no less robust than NFS, (b) no less reliable than the
current storage and (c) at least as good from a performance standpoint.

Technically, this storage is already available to all projects but the
configuration necessary to /replace/ gluster with it would generally
require a per-project outage.  (Involving copying the contents of the
previous store to the new one, and substituting one mount point for
another -- a process which running processes can generally not cope with).

In practice, the tools project will be the first transition "victim",
with an outage tomorrow to make the switchover.  Sadly, the process is
necessarily disruptive and currently running processes will be affected
(see below for details).

As secondary news, thanks to the unwavering efforts of Asher, the
database replication is now at a point where we are actually selecting
what, exactly, is going to be replicated.  Things are progressing there
at a fair clip and we're still sticking pretty close to schedule.  More
news on that topic next week.

=== Planned outage ===

When: Tuesday April 23 at 18:00 UTC
Duration: 1 hour

Impact:

* Jobs running on the grid engine will be stopped, and execution nodes
will be temporarily disabled;
* The login server will be restarted during the window, ending active
sessions;
* The web service will be unavailable during the maintenance window; and
* Running processes not scheduled through the grid engine will be killed.

Recovery plan:

In case of unplanned failure during the maintenance window,
configuration will be rolled back to the current version (that is, the
gluster-based project storage will remain in place) and a new
window will be planned after postmortem.

-- Marc