[Labs-l] Down instances & proposed Nagios changes/issues

Mon Oct 1 11:42:19 UTC 2012

I am the operator of i-0000040c.pmtpa.wmflabs, and i-0000040b.pmtpa.wmflabs.

I personally am not sure why they are showing as offline as they are not.

~Jason

On Sun, Sep 30, 2012 at 2:44 PM, 208.97.132.231
<damian at damianzaremba.co.uk>wrote:

>  Complaining bit
> ==========
> Once again I'm trying to clear up monitoring so we can improve it. The
> following instances are currently reporting as down (some have been for
> quite a while);
>
> * test5 - i-00000026.eqiad.wmflabs
> ** Ryan - ping/nrpe is restricted from pmtpa to eqiad, is this intended?
> Do we want 1 Nagios instance per Region or centralized monitoring. Not an
> issue currently but needs deciding/sorting before bringing the region
> online.
> * wlm-mysql-master - i-0000040c.pmtpa.wmflabs
> * wep - i-000000c2.pmtpa.wmflabs
> * analytics - i-000000e2.pmtpa.wmflabs
> * deployment-backup - i-000000f8.pmtpa.wmflabs
> * deployment-feed - i-00000118.pmtpa.wmflabs
> * configtest-main - i-000002dd.pmtpa.wmflabs
> * deployment-cache-bits02 - i-0000031c.pmtpa.wmflabs
> * puppet-abogott - i-00000389.pmtpa.wmflabs
> * mobile-wlm2 - i-0000038e.pmtpa.wmflabs
> * conventionextension-test - i-000003c0.pmtpa.wmflabs
> * lynwood - i-000003e5.pmtpa.wmflabs
> * wlm-apache1 - i-0000040b.pmtpa.wmflabs
>
> If any of these are yours could you either;
> a) Reply, if it's still pending file recovery from the block storage
> migration (these should all be done).
> b) Delete, if it's not used and has no plan for being used.
> c) Start, if its purpose is to be used and is just stopped from the last
> outage (block storage migration).
> d) Reply, if it needs to be down for some reason (I'll mark it as such in
> monitoring, so it doesn't spam the channel)
> e) Reply, if it's online and functioning as expected (there might be a
> security group etc issue)
>
> Current problems with monitoring
> =====================
> While monitoring everything based on puppet classes makes perfect sense
> for production, currently because most things are not packaged/puppetized
> and we're half dev and half 'semi-production' monitoring rather sucks.
>
> Due to the current state of labs I suggest that we add an attribute to
> instance entries in LDAP that allows monitoring to be enabled and default
> to not monitoring.
>
> Now while that may seem silly, currently we can't really enable the relay
> bot without flooding the channel with nonsense which makes monitoring
> redundant.
>
> Actually limiting spam to things we care about (public http instances,
> mysql servers etc) we can easily see when things are actually breaking.
>
> Downsides
> ---------
> We loose a general overview of instances, which causes a more reactive
> approach - however we're no so proactive currently.
>
> Implementation choices
> ===============
> a) Based on puppet classes (current usage)
> Pros;
> * Monitoring is standard
> * Monitoring is automatic
>
> Cons;
> * We're suppose to be developing, not standardizing (at this point)
> * Important services get masked by dev instances
> * We're not really monitoring services (they're not puppetized, yet)
>
> b) Based on user input (possibly stored in ldap as an entry under the host)
> Pros;
> * People can test/develop monitoring
> * We can monitor things not yet puppetized
> * We can ignore unimportant things
>
> Cons;
> * Monitoring isn't standard
> * Monitoring isn't automatic
> * We're breaking from production in style
>
> While I'd love to spend my time convincing people using puppet is the way
> forward, quite frankly the current state of the repo is a mess. It's partly
> not usable in labs AFAIK (due to the way parameters are handled and is
> general a mix of bad/confusing code that's whitespace hell.
>
> As we move over to role classes with parametrized classes in modules it
> should be easier and quicker to get changes in.
>
> Until there is a push monitoring is either mostly redundant or we can work
> on improving it. As we have semi-production stuff I think we should improve
> it.
>
> The issues become around if we want to enable user based monitoring and
> treat nagios as a dev environment along side puppet classes, keep puppet
> classes exclusively, use user input exclusively or split the usage into 2
> and have puppet based for 'production' services and user based for dev.
>
> It would be 'easy' to allow 'extra' monitoring data to be specified on an
> instances subpage, or even bang it in LDAP - however this could encourage a
> path that we don't want.
>
> Features I'd like to see
> ==============
> * User access to the web interface (ldap authenticated, based on project
> membership)
> * More extensive monitoring of services (think about the beta project and
> how crappy the monitoring for it is currently)
> * Optional subscription of alerts on a per project bases (think about
> semi-production stuff where it would be nice to get an email saying it's
> borked)
> * Puppetization of the setup
> * Expansion of the groups/templates to include everything in puppet that's
> monitored in production (currently it's a very small common list).
> * Grouping based on region
> * Grouping based on host (this is currently exposed via labsconsole, we
> could scrape it for info or talk to nova directly I guess. Harder than the
> above)
> * A bot that doesn't die randomly
> * A way to shard monitoring (per region) for when we get so many instances
> it's not possible to have a single crappy box
>
> Features I'd be interested in exploring
> =======================
> * Using saltstack to grab monitoring info (for example puppet last run
> time, this can be calculated from a state file and pushed back to the
> monitoring instance or polled using minion to minion salt access). SNMP
> traps kinda suck, rely on people updating their puppet clones etc. Without
> adding sudo access to nrpe and writing a script for it there's no other way
> to get root level access to grab the file data. Extending saltstack (if we
> do end up using it widely) and creating a 'feedback loop' would be nice
> * Being able to monitor misc data/servers (think labsconsole - currently
> things like controllers are monitored on production Nagios, this data isn't
> however relayed to #wikimedia-labs or widly open to the labs community).
> While monitoring infrastructure from within its self isn't a good idea
> generally from a centralized community point it might be nice.
> * Adding other software (Graphite) to the 'common use' 'monitoring stack'.
> For example in bots it would be nice to a) monitoring the processes/random
> data in nagios but also b) push metrics out and have historical graphs.
> Downside is graphite isn't currently packaged for ubuntu in public repos,
> it is somewhere for prod though. Also would need some form of proxy to
> determine project name prefix for data coming in.
> * Adding a real api to labsconsole to expose the data we have in there as
> well as allowing the creation/configuration and deletion of instances. JSON
> output of SMW searches rather sucks a little due to the filtering etc.
> * Exposing current status/uptime stats per project and instance on
> labsconsole (not sure how easy it would be to transclude this/images from
> ganglia). The instance pages are mostly useless and uninteresting to look
> at. For example on the beta project it would be interesting to be able to
> say 'it's been up 99.98% this month with a response time of xxxms'. With
> data we can at least have an idea when things are going crappy rather than
> 'it's broken', 'now it's not'.
>
> TL;DR
> Our monitoring currently sucks, we need to get to a place where rolling
> out a cluster based on puppet classes gets auto monitored but also allow
> development without masking useful alerts.
>
> I'm not too sure on the perfect solution right now, however I'd love some
> feedback/ideas from everyone else and to publicise what monitoring we do
> have generally.
>
> Damian
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>

-- 

*Thank you for contacting Jason Spriggs.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20121001/69333b83/attachment.html>