Re: [Foa-cron] Foa-cron Digest, Vol 2, Issue 1 - Foa-cron

3 Feb 2014

I am far from an expert as well, but here are my thoughts regarding the
first point (just brainstorming)..

* Due to networking problems, server A cannot communicate with server B.
Ahas priority for running a task. Since they cannot communicate, B never
learns that A completed the task. So B runs it too. => How much of aproblem
is it if a task runs multiple times?*

In order for it to be a problem that a task ends up running multiple times,
there must be some sort of communication between the servers involved. Only
once Server A says "I'm running the job" or "I'm done", and
Server B
acknowledges, do we have a known duplicate task. If Server B has not
finished the job, it aborts. If Server B has finished the job, an "I'm
done" message from Server A/B should result in changes be propagated by
*either* Server A or Server B, mutually exclusively.

The key is that the final set of changes brought about by a particular
server should only be synced after completion, and can only occur after
successful network communication (otherwise, how can it propagate to
anyone?). This seems to call for a necessary third server acting as a sort
of gatekeeper.

In the worst case, Server A and Server B have completed the identical task
in isolation, and nothing needs to change. One of them will not propagate
their effects (ie. generation of file, sending of an email, compilation of
source code) past the gatekeeper server, which will subsequently release
the effects to those who require it.

This, however, poses interesting questions on how to determine and
communicate which changes need to be propagated by any given cronjob.

Cheers,
John

On Mon, Feb 3, 2014 at 4:01 AM, &lt;foa-cron-request(a)lists.wikimedia.org&gt;wrote;wrote:

...
  Send Foa-cron mailing list submissions to
         foa-cron(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
         https://lists.wikimedia.org/mailman/listinfo/foa-cron
 or, via email, send a message with subject or body 'help' to
         foa-cron-request(a)lists.wikimedia.org

 You can reach the person managing the list at
         foa-cron-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Foa-cron digest..."

 Today's Topics:

    1. Fault Tolerant vs Live (Gregory Manis)

 ----------------------------------------------------------------------

 Message: 1
 Date: Mon, 3 Feb 2014 02:15:30 -0500
 From: Gregory Manis &lt;glm79(a)cornell.edu&gt;
 To: foa-cron(a)lists.wikimedia.org
 Subject: [Foa-cron] Fault Tolerant vs Live
 Message-ID:
         <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ=
 Uw(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset="iso-8859-1"

 I (GLM) edited in a few decisions to be made a couple days ago on our Wiki
 page, and JT put in some ideas. I went to edit in a response, but then
 realized that communicating via Wiki edits seems like a tremendously silly
 idea when we have a mailing list.

 The current requirements of the project include distribution of execution
 and the guarantee that if at least one server is up, a task will be ran.
 I'm far from an expert in terms of this (or anything for that matter), but
 there are a few (perhaps naive) concerns that I have with the requirements.

 From my understanding of how networking works,
there's no way to guarantee  that a node is down. So I'm worried about the
following scenarios:

 Due to networking problems, server A cannot communicate with server B. A
 has priority for running a task. Since they cannot communicate, B never
 learns that A completed the task. So B runs it too. => How much of a
 problem is it if a task runs multiple times?

 A large number of servers are down, and the one that is up has low priority
 for running a task. There's necessarily a delayed execution because the
 running server has to wait for all the others to time out. This can be
 somewhat mitigated by keeping a list of nodes that are up, but that can
 lead to an out of date list resulting in the previous problem. => How
 delayed can running time be? The response on the Wiki mentioned
 administrators adjusting times. Note that this likely involves making the
 crontabs non-standard.

 An updated crontab is created and propagates through to the servers. One
 server is completely disconnected though, so it doesn't receive the new
 table and keeps running old commands. => How bad is it if a deleted job
 runs? I'm actually not that worried about this; I think it's reasonable to
 expect the sysadmin to make sure all servers get the new cron table.

 I suppose the point I'm trying to make is that if you want it to be fault
 tolerant and live down to a single server, it seems like you can run into
 duplication or late tasks if the network isn't perfect. Is there anything
 I'm missing? I'm certainly not discounting the possibility that there's a
 solution (whether clever or simple) to these problems; I just don't see it.

 Thanks!

 -Greg

 P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now