It's still all me on this thread!
This is a note for all of us since we've talked about unattended upgrades
and such as of late. I feel like teh folks on this list are on the same
page but a real world example of recent thinking.
We recently fought with
which
involved rebooting workers. We had been sitting on pending kernel updates
for Debian instances in
because
WMF unattended pulled in new kernels. At the moment the workers are
sitting on 4.9.0-0.bpo.4-amd64 now and all other Debian instances in Tools
are sitting on 4.4.0-3-amd64. Considering the historical virtio issues and
the nightmare of debug I feel like this reinforces our strategy outlined in
to make make managing updates
explicit and ongoing for Toolforge (and novaproxy or other WMCS managed
resources).
root@tools-worker-1016:~# uname -a
Linux tools-worker-1016 4.9.0-0.bpo.4-amd64 #1 SMP Debian 4.9.51-1~bpo8+1
(2017-10-17) x86_64 GNU/Linux
root@tools-puppetmaster-01:/var/lib/git/operations/puppet# uname -a
Linux tools-puppetmaster-01 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf8
(2016-12-22) x86_64 GNU/Linux
I really dislike this kind of inconsistency.
On Mon, Dec 11, 2017 at 2:04 PM, Chase Pettet <cpettet(a)wikimedia.org> wrote:
Replying to myself to separate out review from new
thoughts :)
Two things I wanted to comment on. 1) traffic control and thoughts 2)
deployment naming
1) I think the conclusion was nftables did not have a TC equivalent and
that current techniques were not able to replace our TC usage. We could
get more savvy with targeting using nftables (or iptables) and potentially
the nftables project would look at a traffic control type mechanism.
I talked a little about where I stopped short in the existing
implementation (of bastion resource QoS) at building a wrapper for TC and
in trying to find all the avenues I went down a few years ago I found
https://packages.debian.org/stretch/firehol which appears to be just
that. It has the potential to be unwieldy being sprawling bash but super
interesting (
https://github.com/firehol/firehol/blob/master/sbin/fireqos).
I /think/
https://wiki.nftables.org/wiki-nftables/index.php/
Rate_limiting_matchings from the meeting doc (
https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/
Offsite_Notes/kubecon-2017) would perform similar functions to the limit
module in iptables which drops violators instead of managing a queue to
ensure both ends are sane consumers within the defined throughput limits.
I think this doesn't exactly fit our model. Need to talk to Arturo to
confirm I grok this entirely :) Super excited about the possibility of
making our TC setup potentially more dynamic and sane, and also moving to
something more modern.
2) I think we should sidestep the meaningful names pitfall and go for
something distinct but not inherently descriptive. Main will be
problematic whenever it becomes not "main" in the implicit sense (see me
naming something secondary that then becomes ...primary). Anything we try
to brand w/ a relationship to use case has this issue. I have used colors
before to this end "blue", "black", "orange" environments.
That's just an
example, I actually think we should go wholly generic and use numeric
identify. depone, deptwo, depthree potentially. phonetic: "dep-one",
"dep-two". Contextual "one" "two". We can move the server
from "one" to
"three". IDK. I hate pure numeric tagging less than other approaches I can
think of. Not in love with "dep" as a prefix. Ideas needed.
On Mon, Dec 11, 2017 at 9:01 AM, Chase Pettet <cpettet(a)wikimedia.org>
wrote:
*Original:*
https://etherpad.wikimedia.org/p/kubecon-2017-offsite-agenda
*Archived on office wiki:*
https://office.wikimedia.org/w
iki/Wikimedia_Cloud_Services/Offsite_Notes/kubecon-2017
*Decided:*
- Stick with Trusty through Neutron migration (for now as we think we are
making enough progress on this to ensure Trusty sunset by April 2019.
Xenial seems to have Mitaka so if we have to potentially we can match
mitaka there with Trusty for a migration of OpenStack releases across
releases but that's work we don't want to do and we need to settle on a
distro (see: figure out deployment methodology))
-
https://phabricator.wikimedia.org/T166845 to be done via cumin for now
(long term prometheus?)
- draw.io is a canonical tool
- Dumps work is a carry over goal
- Neutron will be a carry over goal but hopefully not a literal one
*Open Near Term:*
- Neutron Plan: talk about the naming of deployents
- Need to do hard capacity thinking on storage scaling and budgeting
- icon templates for draw.io
*Open Long(er) Term:*
- Need to figure out openstack components deploy methodology (containers,
source, distro packaging...)
- Is SLURM viable?
- kubeadm for Kubernetes deploy?
- Tools Bastion and resource issues
- Is there an upstream alternative that is viable for Quarry?
- How much do we fold into cloud-init?
- Do we use puppet standalone and virtualize the main cloud masters?
- Hiera for instances is a mess and needs to be rethought.
- Trial of paging duty cycles (while still taking advantage of our time
spread)
- How much of labtest is ready for parity testing?
- Document undoing standalone puppetmaster setup
*Missed because of time:*
-
- - puppet horizon interface future (
https://phabricator.wikimedia
.org/T181551 and co)
-
- FUTURE IDEAS: NOT EXISTING
- - new TLDs for public and internal addresses: when and how to deploy
- - new ingress for HTTPS in VPS and Toolforge?
- - monitoring sanely
- - thinking about ceph
- - metrics for end-users - who uses my tools and how? (
https://phabricator.wikimedia.org/T178834
<https://phabricator.wikimedia.org/T178834> and co)
I would like to talk about the missed items if we can find a few minutes
at all hands (over dinner?)
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/>
and IRC
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC