[Labs-admin] ** PROBLEM alert - tools-exec-1433/Puppet run is CRITICAL **

Chase Pettet cpettet at wikimedia.org
Fri Apr 21 15:01:37 UTC 2017


I think this lessened considerably after
https://phabricator.wikimedia.org/T161898#3197879 but did happen a few
times in the last 24 hours (then again we have had for sometime an
underlying low level of puppet transient issues).  I think we'll know in a
few days if this was a light 24h by coincidence or if we are chipping away
here.

I made some changes to get better insight into what exactly is happening
when Puppet is vomiting and to get better consistency in
https://gerrit.wikimedia.org/r/#/c/349433/

On Thu, Apr 20, 2017 at 8:53 AM, Chase Pettet <cpettet at wikimedia.org> wrote:

>
>>
>> So if I'm reading nfs-mount-manager correctly, it sounds like
>> `/usr/bin/timeout -k 10s 20s ls $2` is failing in the
>> `nfs-mount-manager check` for the Exec resource and then the Exec
>> tries to mount the export again? Would that mount command ever
>> actually work on an active host?
>>
>
>
> This is all a game of mousetrap
> <https://i.ytimg.com/vi/Pk1ue1tolFc/maxresdefault.jpg>  where we hope
> that the end result is either a Puppet run failing in a way that doesn't
> screw things up further because NFS is unavailable (for a variety of
> reasons) or ends in successfully ensuring mounts are ready to go.  Then we
> want that to end up trapping the mouse in the missing mount case(s), the
> unhealthy mount cases, and the absent mount cases.
>
> So the working theory there is that in the case of a check failing for
> either condition of not mounted or it is mounted but it's unhealthy to
> where it appears unmounted we try to mount.  The idea being it may not
> succeed in the unhealthy case but it will succeed in the new case, and fail
> as expected in the absent case.  We could create two 'check' like case
> statements with one for health and one for mount status, but historically
> trying to handle them separately ended up with far more edge cases than
> considering them together.  So I think the remount question is answered
> with: check has two conditions that can fail and the declarative Puppet
> idiom is to try to mount something if it comes back failing no matter what
> because mount itself is a safe operation.  Safe in the sense that it will
> fail sanely if something happens to be mounted at the time it tries (and
> that's ok because to get there it already failed to show up as healthy
> prior and we are just carrying it forward).  That is probably more opinion
> on how or if things should surface than anything else.  This is all a huge
> mess and most of nfs_mount.pp should be written as a custom function I
> think for our own sanity and future debugging.  Madhu and I talked about
> this previously but the varied conditions to fail for safely along with
> recovery for take ages to run through the playbook test conditions and
> there hasn't been time.
>
> So things I thought of looking here:
>
> * rewrite nfs-mount-manager in python
> * break out nfs_mount into a custom Puppet function with more logging and
> debug trappings
> * consider breaking up check for nfs-mount-manager into 'health' and
> 'status' (for mount)
> * update the timeout settings in nfs-mount-manager to match
> https://gerrit.wikimedia.org/r/#/c/348788/
>
> That all said this isn't the core problem persay as we have another very
> vanilla "Can NFS service me?" safety check before doing some grid things
> that is intermittently the failing component as well.
>
> In modules/toollabs/manifests/init.pp
>
>     exec {'ensure-grid-is-on-NFS':
>         command => '/bin/false',
>         unless  => "/usr/bin/timeout -k 5s 30s /usr/bin/test -e
> ${project_path}/herald",
>     }
>
> This is the failure about half the spot checks I've done, and
> nfs-mount-manager check is the other half.  I put this second one in awhile
> ago because in some cases of NFS unavailability we were going on ahead with
> the whole insane resource collection madness on the then local disk and
> doing grid setup things that thought this was a node in a whacky state.
> But anyway, this failing is basically dead simple "I can't see a thing on
> NFS".
>
> I caught the tool in this comment today and it /definitely/ was causing
> these failures, although that doesn't mean it's the only thing doing it.
>
> https://phabricator.wikimedia.org/T161898#3197517
>
>
>
>
>
>
>


-- 
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-admin/attachments/20170421/deda2367/attachment.html>


More information about the Labs-admin mailing list