I (GLM) edited in a few decisions to be made a couple days ago on our Wiki
page, and JT put in some ideas. I went to edit in a response, but then
realized that communicating via Wiki edits seems like a tremendously silly
idea when we have a mailing list.
The current requirements of the project include distribution of execution
and the guarantee that if at least one server is up, a task will be ran.
I'm far from an expert in terms of this (or anything for that matter), but
there are a few (perhaps naive) concerns that I have with the requirements.
>From my understanding of how networking works, there's no way to guarantee
that a node is down. So I'm worried about the following scenarios:
Due to networking problems, server A cannot communicate with server B. A
has priority for running a task. Since they cannot communicate, B never
learns that A completed the task. So B runs it too. => How much of a
problem is it if a task runs multiple times?
A large number of servers are down, and the one that is up has low priority
for running a task. There's necessarily a delayed execution because the
running server has to wait for all the others to time out. This can be
somewhat mitigated by keeping a list of nodes that are up, but that can
lead to an out of date list resulting in the previous problem. => How
delayed can running time be? The response on the Wiki mentioned
administrators adjusting times. Note that this likely involves making the
crontabs non-standard.
An updated crontab is created and propagates through to the servers. One
server is completely disconnected though, so it doesn't receive the new
table and keeps running old commands. => How bad is it if a deleted job
runs? I'm actually not that worried about this; I think it's reasonable to
expect the sysadmin to make sure all servers get the new cron table.
I suppose the point I'm trying to make is that if you want it to be fault
tolerant and live down to a single server, it seems like you can run into
duplication or late tasks if the network isn't perfect. Is there anything
I'm missing? I'm certainly not discounting the possibility that there's a
solution (whether clever or simple) to these problems; I just don't see it.
Thanks!
-Greg
P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now
Hello everyone!
Since some of us need to get the ball rolling sooner rather than later,
I have written up a long list of tasks that need to be done at:
https://www.mediawiki.org/wiki/Facebook_Open_Academy/Cron
where those of you who have begun their academic years can already
contribute. Of particular importance at this stage of the project is
the research aspects.
We need to collectively research what elements already exist in the
direction of a distributed cron alternative, that means:
* Find what exists in distributed scheduling in general
* Find what libraries exist for Python for task distribution, peer
server management, etc which can be leveraged for our project.
Everyone needs to start collecting that information and taking notes;
please survey the field of relevant software or papers, and enumerate
them on the wiki page noted above. If you find something promising,
note it down alongside your evaluation. When you start examining a
particular bit of software or paper, note it down on the wiki page so
that efforts are not duplicated needlessly.
Pay attention to research papers in particular; the problem of reliable
distributed computing is well known and heavily researched -- there is
lots of knowledge to tap into there!
Finding something that turns out to be useless is also valuable
information; your research notes should include /why/ it's unsuitable
and what lessons we can learn from it.
On that same wiki page is a list of (preliminary) architectural
decisions that need to be taken. Many will reveal themselves as the
design is being refined. Now the the /best/ time to start making those
decision in light of the information being found in the research section.
-- Marc