[Labs-l] [Labs-announce] Possible reboots and/or outages -- please read

Andrew Bogott abogott at wikimedia.org
Fri May 20 16:03:37 UTC 2016


On 5/20/16 10:45 AM, Maximilian Doerr wrote:
> Does this unusual behavior, also cause erroneous fingerprints being given during authentication.  Yesterday when I was SSHing into Cyberbot-exec-01 via bastion I got a fingerprint mismatch with the exec node.  This was using WinSCP.
That's almost certainly not related.  I'm not sure why that would 
happen, although there are a few possible failure cases in our network 
setup where traffic gets routed not to your VM but directly to the 
nova-network host.  In that case you'd be seeing the wrong key because 
it's the key for labnet1002 instead of for your VM.  The timing doesn't 
hold up but there was a brief network outage a couple of days ago that 
could have caused that.

Do let me know if you see this issue repeatedly -- in particular, if you 
can open a phabricator ticket which includes both host keys (the 
erroneous one and the correct one) then that will help us diagnose things.

-Andrew


>
> To clarify, it got into Bastion just fine but not into my node, because I aborted the authentication process.  I waited 5 minutes and tried again and it worked just fine.
>
> Cyberpower678
> English Wikipedia Account Creation Team
> ACC Mailing List Moderator
> Global User Renamer
>
>> On May 20, 2016, at 11:10, Andrew Bogott <abogott at wikimedia.org> wrote:
>>
>> Note:  Tools users can ignore this message
>>
>>     We are seeing some unusual behavior on labvirt1003, which hosts a large number of labs instances.  The problem is not yet diagnosed, but it is likely a hardware problem that will require reboots or downtime.  Here is a complete list of labs instances currently living on labvirt1003:
>>
>> https://phabricator.wikimedia.org/P3159
>>
>>     If you have any hosts on that box that cannot survive a reboot, please either let me know, or take steps to minimize the damage.  I've removed labvirt1003 from the scheduler, so if you want to build a new instance and migrate services to it you can be assured that the new instance will be isolated from the coming chaos.
>>
>>     A simple reboot shouldn't produce more than 5-10 minutes of downtime.  If a major outage seems likely, I'll follow up with additional warning.
>>
>> -Andrew
>>
>>
>> _______________________________________________
>> Labs-announce mailing list
>> Labs-announce at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-announce
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l





More information about the Labs-l mailing list