[Labs-l] Fwd: [Ops] Evaluation of clustering solutions (continued)

Yuvi Panda yuvipanda at gmail.com
Tue Aug 25 11:11:24 UTC 2015


Just a FYI. Also remember, no changes to gridengine from all of this -
it'll merely be an alternative way to run webservices to begin with,
and any changes we make will be fully transparent to tool users.


---------- Forwarded message ----------
From: Giuseppe Lavagetto <glavagetto at wikimedia.org>
Date: Tue, Aug 25, 2015 at 1:05 PM
Subject: [Ops] Evaluation of clustering solutions (continued)
To: Operations Engineers <ops at lists.wikimedia.org>, Wikimedia
developers <wikitech-l at lists.wikimedia.org>, Development and
Operations Engineers <engineering at lists.wikimedia.org>


Hi all,

as previously announced, we've been evaluating a "clustering solution"
for use as an alternative to GridEngine for toollabs

https://lists.wikimedia.org/pipermail/wikitech-l/2015-August/082853.html

Our goal is also to find a suitable, modern, stable tool to run not
only toollabs webservices, but also - on a longer term - to find a
modern, easier, more convenient way to run our microservices in
production: a clusterized environment that will allow us to enhance
single service availalbility and also to apply easier scaling of
applications, reducing further the friction surface and the direct ops
involvement in the day-to-day setup and deployment of services.

Our evaluation of the available solutions is ongoing, and while we're
mostly done filling up an "evaluation spreadsheet"
(https://docs.google.com/spreadsheets/d/1YkVsd8Y5wBn9fvwVQmp9Sf8K9DZCqmyJ-ew-PAOb4R4/edit?usp=sharing),
we would welcome and we encourage further involvement/suggestions. You
can provide these easily on the tracking ticket for the evaluation,
https://phabricator.wikimedia.org/T106475

We received some interesting feedback already, and we look forward
incorporating more!

 We are considering two solutions - mesospheres' Marathon (which is
based on Mesos) - https://mesosphere.github.io/marathon/ and Google's
Kubernetes https://kubernetes.io.

Now let us summarize a bit our findings so far:
MESOS/MARATHON:

    Pros:
        - Mesos is stable and battle tested, although Marathon is
quite young and mostly used in mesosphere's commercial offering
        - Supports overcommitting resources (which is important in
toollabs, probably less so in production)
        - Has a nice, clean API and is fully distributed with no potential SPOFs
        - Chronos is another framework that can run on mesos and is a
great distributed cron

    Cons:
        - Multitenancy story is non-existent, it was not designed to
be a public PaaS offering. This is an issue even in production if we
want to grant independence to single teams.
        - Container support seems experimental at best.(but getting
better in newer versions)
        - Adoption of Marathon seems little and the community is not
very lively.
        - Discovery/scaling logic is somewhat limited

KUBERNETES

    Pros:
        - The design seems to be very well thought out, based off of
experiences running Google's internal Borg system (see
http://research.google.com/pubs/pub43438.html for details of Google's
Borg clustering system).
        - A pretty refined security model is already implemented, so
that single users/teams could be given access to individual namespaces
and act independently
        - The community is very lively, and adoption is gaining
momentum: kubernetes is the default way to deploy apps on Google
Compute Engine, it's used by Red Hat for its own cloud solution (and
they contribute patches to it), it has a clear roadmap to overcome
most of its limitations
        - Container support is native and it's tecnology-agnostic,
allowing (for now) Docker and Rkt containers to be used
        - The API is quite nice
        - Documentation is decently complete
        - Google engineers are actively supporting us in evaluating its usage
    Cons:
        - The master node is not highly available, although our
cluster survived a pretty serious outage in labs that froze the master
and wiped out one worker
        - No overcommitting allowed, it will be possible to mimic it
with QoS (coming in the next version)
        - The ability to schedule one-off jobs is offered, but there
is no distributed cron facility
        - In general it's a younger project with some outstanding bugs

As you can see there are pretty big pros/cons for both these
technologies, due to the fact they are still quite "not boring" -
although one could argue that mesos and chronos at least have entered
their "boring" stage. Our spreadsheet slightly favours Kubernetes at
the moment, but that might change drastically, if we evaluate that
some limitations are absolute showstoppers for us.

In the remainder of this week and the next few ones, we will keep
stress testing both our test installations to find out "surprises" and
bugs.

Let us know what you think - or reach out to us if you want to help in
this evaluation process. We will keep you posted!

Cheers,

Giuseppe & Yuvi

_______________________________________________
Ops mailing list
Ops at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops



-- 
Yuvi Panda T
http://yuvi.in/blog



More information about the Labs-l mailing list