[Labs-l] Disruptive Tools NFS maintenance on 11/2/2016

Madhumitha Viswanathan mviswanathan at wikimedia.org
Fri Oct 21 20:10:47 UTC 2016


On Fri, Oct 21, 2016 at 12:14 PM, Martin Domdey <animalia at gmx.net> wrote:

> "(so tools that rely solely on the database / API *should* be fine, if
> they aren't using their home directories for anything write) "
>
> Do you mean with that, that I can use an own tool server instance and I
> can work with API, database and /shared/tcl/bin/tclsh8.7 ?
> So can I get a tool server from you?
>

I'm not sure I understand, but any existing tools that talk to the
database/apis should be fine. If you would like to make a new tool for a
new project - you can request it as usual using the Tools Access request
page.

>
> Martin ...
>
>
> *Gesendet:* Freitag, 21. Oktober 2016 um 20:56 Uhr
> *Von:* "Madhumitha Viswanathan" <mviswanathan at wikimedia.org>
> *An:* "Wikimedia Labs" <labs-l at lists.wikimedia.org>
> *Betreff:* Re: [Labs-l] Disruptive Tools NFS maintenance on 11/2/2016
> Hi,
>
> On Fri, Oct 21, 2016 at 11:29 AM, Martin Domdey <animalia at gmx.net> wrote:
>>
>> Why do you need 48 hours for that?
>>
>> I'm submitting very many cron jobs the day to deliver much stuff and
>> services to a lot of users in dewiki and other wikis. An outage window of
>> 48 hours (!) is simply not possible.
>> Please suggest a solution how I can work on during the outage window or
>> at least a crontab that can handle the data and files on tools.taxonbot.
>> You maybe can install a NFS redundancy for at least that time.
>>
>>
> Like mentioned, it may take upto 48 hours for the data migration to be
> complete - hopefully lesser, but we are dealing with a complex system with
> a nontrivial amount of data. The transition *is* to a redundant NFS server
> setup - we need a long maintenance window to make that happen. A full copy
> of tools data to a new server takes many days(~4-20!) depending on various
> factors, and we're doing successively smaller syncs to make the final
> migration period as small as possible. However, it's still not something we
> can entirely control - the maps project was migrated earlier this week, and
> the final sync still took about a day (even though maps has less data). So
> the 48h is a conservative estimate that allows us to do the migration in an
> orderly fashion.
>
> To be more explicit, here is a (non exhaustive) list of things we expect
> to not work for the duration of the transition (which is up to 48h, but
> hopefully lesser):
>
>     1. Submitting new jobs to the grid
>     2. Restarting failing jobs on the grid
>     3. Deploying new code / writing anything on your tool / home
> directories
>     4. Any bots / webservices that require write access on their home
> directories to work (so tools that rely solely on the database / API
> *should* be fine, if they aren't using their home directories for anything
> write)
>     5. New cron jobs (because of #1)
>     6. New tool creation
>
> Any previously submitted jobs that aren't writing to NFS (provided they
> don't die), will continue to run. Crons submit jobs to the grid, and
> without read-write NFS, job scheduling will not work. We apologize for the
> service interruption, but it is required to have a long term stable &
> reliable tools.
>
> We're working on a detailed checklist for the transition, and will email
> it to the list once we have it available.
>
>
>> Thank you
>> Martin ...
>>
>>
>>
>> *Gesendet:* Freitag, 21. Oktober 2016 um 20:00 Uhr
>> *Von:* "Madhumitha Viswanathan" <mviswanathan at wikimedia.org>
>> *An:* "Wikimedia Labs" <labs-l at lists.wikimedia.org>,
>> labs-announce at lists.wikimedia.org
>> *Betreff:* [Labs-l] Disruptive Tools NFS maintenance on 11/2/2016
>> As the next step in our storage redundancy and reliability efforts for
>> Labs, we have a significant migration coming up on 11/2 starting 08:00
>> PST(15:00 UTC) involving the tools NFS share. The maintenance window can be
>> up to 48h long, and will affect most running tools. At the end of the
>> migration, everything (except transient jobs) should ideally be working the
>> same way as they were before the migration, but better.
>>
>> Here's what to expect during the maintenance window:
>>
>> * The tools NFS share (/data/project and /home) will be read-only for the
>> duration of the maintenance, so no new data or logs will get written to it.
>> * New jobs cannot be submitted for the whole maintenance window - this
>> means submitting jobs through cron or tools-mail will not function,
>> although tools-mail can continue to send emails.
>> * Current jobs might keep running, but won't get rescheduled if they die.
>> If they do not die and aren't writing to NFS they should be fine.
>> * All exec nodes will get depooled, rebooted and repooled and jobs that
>> don't get rescheduled automatically will have died and need manual restarts.
>>
>> Do let us know if you have any questions or concerns on the lists or on
>> #wikimedia-labs.
>>
>> --
>> Madhumitha Viswanathan
>> Operations Engineer, Wikimedia Labs
>> _______________________________________________ Labs-l mailing list
>> Labs-l at lists.wikimedia.org https://lists.wikimedia.org/
>> mailman/listinfo/labs-l
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>
>
>
> --
> --Madhu :)
> _______________________________________________ Labs-l mailing list
> Labs-l at lists.wikimedia.org https://lists.wikimedia.org/
> mailman/listinfo/labs-l
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>


-- 
--Madhu :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20161021/2a37dfe8/attachment-0001.html>


More information about the Labs-l mailing list