[Labs-l] Disruptive Tools NFS maintenance on 11/2/2016

Madhumitha Viswanathan mviswanathan at wikimedia.org
Fri Oct 21 18:56:54 UTC 2016


Hi,

On Fri, Oct 21, 2016 at 11:29 AM, Martin Domdey <animalia at gmx.net> wrote:

> Why do you need 48 hours for that?
>
> I'm submitting very many cron jobs the day to deliver much stuff and
> services to a lot of users in dewiki and other wikis. An outage window of
> 48 hours (!) is simply not possible.
> Please suggest a solution how I can work on during the outage window or at
> least a crontab that can handle the data and files on tools.taxonbot. You
> maybe can install a NFS redundancy for at least that time.
>
>
Like mentioned, it may take upto 48 hours for the data migration to be
complete - hopefully lesser, but we are dealing with a complex system with
a nontrivial amount of data. The transition *is* to a redundant NFS server
setup - we need a long maintenance window to make that happen. A full copy
of tools data to a new server takes many days(~4-20!) depending on various
factors, and we're doing successively smaller syncs to make the final
migration period as small as possible. However, it's still not something we
can entirely control - the maps project was migrated earlier this week, and
the final sync still took about a day (even though maps has less data). So
the 48h is a conservative estimate that allows us to do the migration in an
orderly fashion.

To be more explicit, here is a (non exhaustive) list of things we expect to
not work for the duration of the transition (which is up to 48h, but
hopefully lesser):

    1. Submitting new jobs to the grid
    2. Restarting failing jobs on the grid
    3. Deploying new code / writing anything on your tool / home directories
    4. Any bots / webservices that require write access on their home
directories to work (so tools that rely solely on the database / API
*should* be fine, if they aren't using their home directories for anything
write)
    5. New cron jobs (because of #1)
    6. New tool creation

Any previously submitted jobs that aren't writing to NFS (provided they
don't die), will continue to run. Crons submit jobs to the grid, and
without read-write NFS, job scheduling will not work. We apologize for the
service interruption, but it is required to have a long term stable &
reliable tools.

We're working on a detailed checklist for the transition, and will email it
to the list once we have it available.

Thank you
> Martin ...
>
>
>
> *Gesendet:* Freitag, 21. Oktober 2016 um 20:00 Uhr
> *Von:* "Madhumitha Viswanathan" <mviswanathan at wikimedia.org>
> *An:* "Wikimedia Labs" <labs-l at lists.wikimedia.org>,
> labs-announce at lists.wikimedia.org
> *Betreff:* [Labs-l] Disruptive Tools NFS maintenance on 11/2/2016
> As the next step in our storage redundancy and reliability efforts for
> Labs, we have a significant migration coming up on 11/2 starting 08:00
> PST(15:00 UTC) involving the tools NFS share. The maintenance window can be
> up to 48h long, and will affect most running tools. At the end of the
> migration, everything (except transient jobs) should ideally be working the
> same way as they were before the migration, but better.
>
> Here's what to expect during the maintenance window:
>
> * The tools NFS share (/data/project and /home) will be read-only for the
> duration of the maintenance, so no new data or logs will get written to it.
> * New jobs cannot be submitted for the whole maintenance window - this
> means submitting jobs through cron or tools-mail will not function,
> although tools-mail can continue to send emails.
> * Current jobs might keep running, but won't get rescheduled if they die.
> If they do not die and aren't writing to NFS they should be fine.
> * All exec nodes will get depooled, rebooted and repooled and jobs that
> don't get rescheduled automatically will have died and need manual restarts.
>
> Do let us know if you have any questions or concerns on the lists or on
> #wikimedia-labs.
>
> --
> Madhumitha Viswanathan
> Operations Engineer, Wikimedia Labs
> _______________________________________________ Labs-l mailing list
> Labs-l at lists.wikimedia.org https://lists.wikimedia.org/
> mailman/listinfo/labs-l
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>


-- 
--Madhu :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20161021/5ad3afcc/attachment-0001.html>


More information about the Labs-l mailing list