---------- Forwarded message ----------
From: Liam Wyatt <liamwyatt(a)gmail.com>
Date: Sun, Feb 4, 2018 at 11:21 PM
Subject: [Wikimania-l] Wikimania 2018 Call for Submissions
To: "Wikimania general list (open subscription)"
<wikimania-l(a)lists.wikimedia.org>, Wikimedia Mailing List
<wikimedia-l(a)lists.wikimedia.org>
Cc: Program committee list <wikimania-program(a)lists.wikimedia.org>
Dear Wikimedia community,
We are pleased to announce that Wikimania 2018 is now accepting
proposals for workshops, discussions, presentations, or research
posters to give during the conference. To read the full instructions
visit the event wiki and click on the link provided there to make your
proposal:
https://wikimania2018.wikimedia.org/wiki/Submissions
The deadline for submissions is 23:59 UTC on Sunday March 18, 2018.
This is approximately 6 weeks away. Whether you are a community member
of one of the Wikimedia projects, or a fellow open content creator or
consumer, we welcome your proposal for a session.
Theme
This year, the conference will be taking place in Cape Town, South
Africa, where the organisers are giving this Wikimania a unique flavor
— an explicit theme based in African philosophy:
“Bridging knowledge gaps, the ubuntu way forward.”
Read more about this theme, why it was chosen, and what it means for
the conference program at the Wikimedia blog:
https://blog.wikimedia.org/2018/02/05/wikimania-cape-town-ubuntu/
Throughout the conference program, this theme will be tightly held,
but loosely defined - in order to encourage a diverse range of
responses to the theme. It is our hope that this theme will give us
the opportunity to further our goal of creating the “sum of human
knowledge”, by encouraging greater diversity and inclusion in who
participates and what we discuss at Wikimania.
To learn more, and to make a proposal for the conference, please visit:
https://wikimania2018.wikimedia.org/wiki/Submissions
Please forward this announcement to other lists and groups across the
Wikimedia movement.
We look forward to reading your submissions. Sincerely,
Program committee co-chairs Emna Mizouni, Felix Nartey, and Liam Wyatt.
_______________________________________________
Wikimania-l mailing list
Wikimania-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimania-l
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
Tue:
* answered question and reclosed https://phabricator.wikimedia.org/T176926
* looked at https://phabricator.wikimedia.org/T187850 (labsdb
mediawiki-config cleanup)
Wed:
* looked at https://phabricator.wikimedia.org/T187792 (wikihistory
pounding toolsdb)
Thur:
* quiet day for me, but big progress on PAWS things from Chico and Zhu
Fri:
* poked at tools.paws k8s stuff with Chico
Mon:
* SRE meeting
** Filippo promoted to Senior SRE
** Mark starting to plan for "SRE summit" at the end of the fiscal year (June)
** DB backup work progressing, but will be tight for goal depending on
fires that pop up
*** DBAs asking for "shielding" help so that they can focus on their
goals; try to intercept things that you can that are addressed to them
** Lots of hardware things happening both procurement and replace
** misc db cluster decomm <https://phabricator.wikimedia.org/T183469>
flagged as needing WMCS help. (Was this just the backup thing that
Madhu handled?)
** Row B network work coming <https://phabricator.wikimedia.org/T183585>
** [BLOCKED] SMART monitoring: enable on lab* hosts
https://gerrit.wikimedia.org/r/c/412860/
** Raised our procurement, DCOPS tasks
** Mentioned Arturo's nftables discussion
** Alex/Giuseppe to present to kubecon EU, you will get a preview
presentation https://events.linuxfoundation.org/events/kubecon-cloudnativecon-europe-201…
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
2018-02-27 20:00:02,290 INFO force is enabled
2018-02-27 20:00:02,365 INFO removing tools-project-backup
2018-02-27 20:00:02,490 INFO removing tools-project-backup
2018-02-27 20:00:03,034 INFO creating tools-project-backup at 2T
2018-02-27 20:00:03,852 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,852 INFO force is enabled
2018-02-27 20:00:03,868 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,869 INFO removing tools-snap
2018-02-27 20:00:03,885 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,901 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,902 INFO removing tools-snap
2018-02-27 20:00:05,675 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,716 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,744 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,745 INFO creating tools-snap at 1T
2018-02-27 20:00:05,997 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:06,033 ERROR b' /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
A lot of things are in the works for which I'll either add an agenda item
to the weekly or will have a followup meeting but it has reached the point
of a preface email to any discussion being more efficient. /Please/ 'ack'
this with a response because there are things in here that affect everyone
on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
In Austin we had said that the long tailed, delayed, and (some would say)
tortuous march of Neutron should mean we stick on Liberty and Trusty for
the time being to avoid the historic moving target problem. In making the
annual plan and lining up the many changes that have to occur in the next
15 months it became clear that if we do all of this in series, instead of
in parallel, we will never make it. We have to shift more sand under our
feet than feels entirely comfortable. That means moving to Mitaka
<https://www.openstack.org/software/mitaka/> before/as-we target Neutron in
order to mix in Jessie with backports (which also has Mitaka). The update
to Mitaka has a few challenges -- primarily that the designate project made
significant changes
<https://docs.openstack.org/designate/pike/admin/upgrades/mitaka.html>. I
think I would like to standup new hypervisors ASAP once the main deployment
is running Mitaka so we can have customer workloads testing for as long as
possible. This in theory sets us up for an N+1 upgrade path on Debian
through Stretch and Pike.
<https://phabricator.wikimedia.org/T169099#3959060>
== On monitoring and alerting ==
Last Oct I made a task <https://phabricator.wikimedia.org/T178405> to
update some of our alerting logic, and in Austin we talked about how to
improve our coverage and move towards a rotation based workflow. The move
to having a 'normal' on-call rotation, and especially one where we take
better advantage of our time-zone spread, is going to require some more
sophisticated management than we have now, primarily: escalations and
complicated alerting and acknowledgement logic.
This came to the forefront again with the recent loss of labvirt1008.
AFAICT the hypervisor rebooted in <=4m
<https://phabricator.wikimedia.org/T187292#3971877> and so did not alert.
There is also the problem of it coming back up and not alerting on the
"bad" state that has client instances shutdown. We reviewed that behavior
and are in agreement that instances starting by default on hypervisor
startup has more downsides than up, but it should still be an alert-able
errant state. I created a wmcs-team
<https://gerrit.wikimedia.org/r/c/410525/> and added a check
<https://gerrit.wikimedia.org/r/c/413452/> that changes our new-normal to
be some instance running on every active hypervisor. Then I proceeded
adding a bunch of checks
<https://gerrit.wikimedia.org/r/q/topic:%2522openstack%2522+(status:open%20O…>,
adjusting existing checks to alert wmcs-team, and changing some checks to
be 'critical' that were not.
The icinga setup in some ways makes single-tenant assumptions that we'll
have to work through, such as: 'critical' alerts all opsen, and is the only
way to override the configuration to never re-alert. At the moment none of
the checks that alert purely wmcs-team, and not all of opsen, will
re-alert. Some checks may double-alert where WMCS roots are in both
groups. There is also the coverage issue where there are checks that may
make sense for those us in this group to receive alerts 24/7, or at lower
thresholds for warning, but it would cause fatigue to alert all of opsen.
I have made a change for myself that has the following effect: regular
critical alerts are on a standard 'awake' schedule and wmcs-team alerts are
still 24/7. Andrew, Madhu, and myself have been on a 24/7 alerting
schedule for a long time now, and I think shifting to 24/7 for wmcs-team
things is an interim step for all of us. This has the side effect of
requiring that all things we want to get alerted to 24/7 are alerting the
wmcs-team contact group.
I am going to schedule a meeting to review what is currently alerting
wmcs-team. This is both so that we can talk as a group about what should
alert, and so that we can talk as a group about what does currently. I
want everyone to walk away knowing what pages could be sent out and the
basics of what they mean. I want everyone in the group to walk away
feeling comfortable with our transitional strategy, and acknowledging as a
group what things we need to know about 24/7. We can talk about how to
take advantage of our time-zone spread in this arrangement, and briefly
talk about what it would mean to move to something based on
pagerduty/victorops.
The introduction of wmcs-team should allow us to also have our own IRC
alerting in combination with #wikimedia-operations to #wikimedia-cloud-feed
(or wherever). One of the complaints it seems we have all had is that
while treating IRC as persistent for alerting is problematic, it is even
more problematic in a channel as noisy as #wikimedia-operations.
Chico has expressed a desire to contribute while IRC is dormant and we have
begun a series of 1:1 conversations about our environment. He has been
working on logic to alert on a portion of puppet failures
<https://gerrit.wikimedia.org/r/c/411315>rather than than every puppet
failure. This, to my mind, does not mean we have solved the puppet
flapping issue but it's also not doing us any good to be fatigued by an
issue we do not have time to investigate that has been seemingly benign for
a year. I am considering whether we should move this to tools.checker,
increase retries on our single puppet alerting logic, and add alerting to
the main icinga for it. Hopefully, we can talk abou this in our meeting.
== Naming (the worst of all things) ==
==== cloud ====
We have continued to phase out the word 'Lab', and even some networking
equipment <https://phabricator.wikimedia.org/T187933> has made the change.
As part of the Debian and Neutron migrations we need to replace or re-image
many of our servers, and it seems like the ideal time to acknowledge a
'cloud' variant naming replacement. In our weekly meeting I proposed 'cld'
as a replacement to 'lab' outright. In discussions on ops-l it seems
'lab'=>'cloud' is most desired for simplicity and readability. 'cloud' as
a prepend seems fine to me, and I don't anticipate objections within the
team so I'm considering it decided (most of us are on ops-l).
==== labtest ====
Lab[test]* needs to be changed as well. The 'test' designation here has
been confusing for everyone who is not Andrew and myself numerous times
over the last year(s). For clarity, the lab[test] environment is a long
lived staging and PoC grounds for openstack provider testing where we need
actual integration into hardware, or where functionality cannot be tested
in an openstack-on-openstack way. Testing VXLAN overlay for instance is in
this category. Migration strategy for upgrade paths of Openstack itself,
especially where significant networking changes are made, would be in this
category. Hypervisor integration where kernel versions need to be vetted,
and package updates need to be canaried are in this category. Lab[test]
will never have tenants or projects other than ourselves. This has not
been obvious and, as an environment, it has been thought to be transient,
temporarily, and/or customer facing at various points.
My first instinct was to fold the [test] naming into whatever next phase
normal prepend we settle on (i.e. cloud). Bryan pointed out that making it
more difficult to discern between customer facing equipment and internal
equipment is a net-negative even if it did away with the confusion we are
living with now. I propose we add a indicator of [i] to all "cloud"
equipment and *nothing with this indicator will ever be customer facing*.
The current indicator of [test] is used both for hiera targeting via
regex.yaml and as a human indicator.
lab => cloud
cloudvirt1001
cloudcontrol1001
cloudservices1001
cloudnodepool1001
labtest => cloudi
cloudicontrol2003
cloudivirt2001
cloudivirt2002
Or open to suggestion, but we need to settle on something this week.
==== deployments and regions (oh my) =====
I have struggled with this damn naming thing for so long I am numb to it
:) I have the following theory: there is no defensible naming strategy
only ones that do not make you vomit.
===== Current situation =====
We have been working with the following assumptions: a "deployment" is a
superset of an openstack setup (keystone, nova, glance, etc) where each
"deployment" is a functional analog. i.e. even though striker is not an
openstack component it is a part of our openstack ...stack and as such is
assignable to a particular deployment. deployment => region =>
component(s)[availablility-zones]. Where we currently have 2 full and 1
burgeoning deployment: main (customer facing in eqiad), labtest (internal
use cases in codfw), and labtestn (internal PoC neutron migration
environment). FYI in purely OpenStack ecosystem terms, the shareable
portions between regions are keystone and horizon.
role::wmcs::openstack::main::control
deployment
-> region
--> availability zone
main
-> eqiad
--> nova
So far this has been fine and was a needed classification system to make
our code mulit-tenant at all. We are working with several drawbacks at the
moment: labtest is a terrible name (as described above), labtestn is
difficult to understand, if we pursue the labtest and labtestn strategy we
end up with mainn, regions and availability zones are not coupled to
deployment naming, these names while distinct do not lend themselves to
cohesive expansion. On and on, and nothing will be perfect but we can do a
lot better. I have had a lot of issues in finding a naming scheme that we
can live with here, such as:
* 'db' in the name issue
* 1001 looking like a host issue
* labtest is a prepend (labtestn is not)
* unclarity on internal/staging/PoC usage and customer facing
* schemes that provide hugely long and impractical names
===== proposed situation =====
I do not feel that enamored with any naming solution other than all the
ones I've tried end up with oddities and particular ugliness.
[site][numeric](deployment)
-> [site][numeric][r postfix for region] (region)
--> [site][numeric][region][letter postfix for row] (availability zone --
indicator for us that will last a long time I expect)
# eqiad0 is now 'main' and will be retired with neutron. It also will
not match the consistent naming for region, etc.
# legacy to be removed
# role::wmcs::openstack::eqiad0::control
eqiad0
-> eqiad
--> nova
# Once the current nova-network setup is retired we end up at deployment 1
in eqiad
eqiad1
-> eqiad1r
--> eqiad1rb
--> eqiad1rc
# role::wmcs::openstack::codfwi1::control
codfwi1
-> codfwi1r
--> codfwi1rb
codfwi2
-> codfwi2r
--> codfwi2rb
This takes our normal datacenter naming ([dc provider][airport]) and
includes an 'i' if internal use cases along with a numeric postfix for
deployment per site, and postfixes for sub-names such as "region" or
"availability-zone". It's not phonetic but it could work. I am going to
drop a few links I've walked through in the bottom section (#naming). My
only ask is if you have a concern, please suggest an alternative that is
thought out to at least 3 deployments per site and differentiates
"internal" and "external" use cases. I can change our existing deployments
without too much fanfare. These are basically key namespaces in hiera, and
class namespaces in Puppet at the moment. I won't bother updating the
regions or availability zones in place that exist now -- until
redeployment. It becomes decidedly more fixed as we move into more eqiad
deployments (as I have no plans to change the existing eqiad deployment in
place). This is influenced by my experience in naming things in the
networking world where there are multiple objects tied together to achieve
a desired end, such as: foo-in-rule-set, foo-interface, foo-out-rule-set,
foo-provider-1, etc.
I, absurdly, have more to write but this is enough for a single email.
Implications for Neutron actually happening, Debian, next wave of reboots,
team practices, and more will be separate. Please ack this and provide
feedback or I'm a runaway train.
Best,
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
https://www.openstack.org/software/mitaka/https://docs.openstack.org/designate/pike/admin/upgrades/mitaka.htmlhttps://phabricator.wikimedia.org/T178405https://phabricator.wikimedia.org/T187292#3971877https://gerrit.wikimedia.org/r/c/410525/https://gerrit.wikimedia.org/r/c/413452/https://gerrit.wikimedia.org/r/q/topic:%2522openstack%2522+(status:open%20O…https://phabricator.wikimedia.org/T187933https://gerrit.wikimedia.org/r/c/411315
#naming
https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventionshttps://en.wikipedia.org/wiki/NATO_phonetic_alphabet#International_adoptionhttp://moodycamel.com/blog/2011/alphabetical-fruit-naming-conventionhttps://softwareengineering.stackexchange.com/questions/143435/what-are-the…https://en.wikipedia.org/wiki/Naming_conventionhttps://en.wikipedia.org/wiki/Computer_network_naming_schemehttps://en.wikipedia.org/wiki/Product_naming_conventionhttps://en.wikipedia.org/wiki/Systematic_namehttps://namingschemes.com/Main_Pagehttps://en.wikipedia.org/wiki/Onomasticshttps://namingschemes.com/Phonetic_Alphabethttps://namingschemes.com/Greek_Alphabethttps://xkcd.com/910/https://namingschemes.com/Periodic_Table_of_Elementshttps://tools.ietf.org/html/rfc1178http://www.domainhandbook.com/humor1.htmlhttp://www.obofoundry.org/principles/fp-002-format.htmlhttps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availabil…https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-ei…https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availabil…https://www.cloudconformity.com/conformity-rules/EC2/ec2-instance-naming-co…https://puppet.com/docs/puppet/5.0/lang_reserved.htmlhttps://www.thebalance.com/military-phonetic-alphabet-3356942
I cannot over-emphasize how much nicer on-call duty is not that Chico
handles most IRC issues. Thank you Chico!
Tuesday:
- Created a project for Chico
Wednesday:
- That labvirt1008 outage :(
- A user had trouble with Horizon access (mostly handled by Bryan,
resolved when they re-configured 2fa)
- Created the 'wmf-research-tools' project
Thursday:
- NFS troubles with the video renderer -- Madhu fixed this in short order.
- There were a bunch of icinga alarms firing:
-- The labpuppetmasters were complaining because I broke ferm for a bit
-- fixed
-- Labtestvirts were alerting because of Chase's new 'make sure there's
a VM running on every labvirt' test. I ack'd the alerts.
-- Labcontrol1001 was alerting because someone removed the 'novaadmin'
user from the cvn project. I replaced it.
- Approved a tool request and bounced another one
Saturday:
- deleted wikidata-topicmaps!!
Monday:
- Approved a couple of tools accounts
2018-02-21 20:00:02,438 INFO force is enabled
2018-02-21 20:00:02,500 INFO removing misc-project-backup
2018-02-21 20:00:02,624 INFO removing misc-project-backup
2018-02-21 20:00:03,291 INFO creating misc-project-backup at 2T
2018-02-21 20:00:04,164 INFO force is enabled
2018-02-21 20:00:04,182 INFO removing misc-snap
2018-02-21 20:00:04,229 INFO removing misc-snap
2018-02-21 20:00:05,001 INFO creating misc-snap at 1T
2018-02-20 20:00:02,630 INFO force is enabled
2018-02-20 20:00:02,656 INFO removing tools-project-backup
2018-02-20 20:00:02,714 INFO removing tools-project-backup
2018-02-20 20:00:03,261 INFO creating tools-project-backup at 2T
2018-02-20 20:00:04,019 INFO force is enabled
2018-02-20 20:00:04,050 INFO removing tools-snap
2018-02-20 20:00:04,085 INFO removing tools-snap
2018-02-20 20:00:05,452 INFO creating tools-snap at 1T
2018-02-14 20:00:02,918 INFO force is enabled
2018-02-14 20:00:02,982 INFO removing misc-project-backup
2018-02-14 20:00:03,551 INFO removing misc-project-backup
2018-02-14 20:00:04,195 INFO creating misc-project-backup at 2T
2018-02-14 20:00:05,111 INFO force is enabled
2018-02-14 20:00:05,141 INFO removing misc-snap
2018-02-14 20:00:05,191 INFO removing misc-snap
2018-02-14 20:00:05,676 INFO creating misc-snap at 1T
2018-02-13 20:00:03,162 INFO force is enabled
2018-02-13 20:00:03,195 INFO removing tools-project-backup
2018-02-13 20:00:03,245 INFO removing tools-project-backup
2018-02-13 20:00:03,721 INFO creating tools-project-backup at 2T
2018-02-13 20:00:04,525 INFO force is enabled
2018-02-13 20:00:04,554 INFO removing tools-snap
2018-02-13 20:00:04,589 INFO removing tools-snap
2018-02-13 20:00:06,079 INFO creating tools-snap at 1T