[Labs-l] second attempt to request alternative login server

Wed Mar 6 16:54:46 UTC 2013

okay this is third time when we have same outage... bastion2 and 3
were accessible for short time after bastion1's gluster died, then
they died as well. public keys weren't accessible on any of them so
basically labs were inaccessible for anyone.

<3 passwords

anyway, I know you all hate working things, like passwords, so there
is another idea.

Set up a cron script that sync a local folder on bastion with
/public/keys so that when gluster is down or that folder isn't working
login to bastion's still works.

On Sun, Mar 3, 2013 at 9:37 PM, Petr Bena <benapetr at gmail.com> wrote:
> YAY. it would be cool if some of them would be mirroring keys from
> gluster so that it works even when gluster is down :>
>
> On Sun, Mar 3, 2013 at 8:43 PM, Ryan Lane <rlane at wikimedia.org> wrote:
>> On Sun, Mar 3, 2013 at 7:51 AM, Petr Bena <benapetr at gmail.com> wrote:
>>>
>>> HI,
>>>
>>> today it's second time that bastion was inaccessible:
>>>
>>> If you are having access problems, please see:
>>>
>>> https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances
>>> debug1: Authentications that can continue: publickey
>>> debug1: Next authentication method: publickey
>>> debug1: Offering RSA public key: /home/petanb/.ssh/id_rsa
>>> debug2: we sent a publickey packet, wait for reply
>>>
>>>
>>> if we can't have a different way to authenticate than using public
>>> keys WHICH ARE broken often - can we have at least second stable login
>>> server.
>>>
>>> BTW I assume that logins didn't work because of gluster so that it
>>> wouldn't work anyway, but if gluster suck so hard, can we at least
>>> have password auth until you fix it? Bad authentication is better than
>>> no working authentication
>>>
>>
>> Though I'm usually more than happy to blame gluster, this was not caused by
>> gluster. It was because someone OOM'd the instance.
>>
>> We've actually finally stablized gluster to a point where we shouldn't be
>> having complete outages any more:
>>
>> https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=Glusterfs+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
>>
>> Note in the above graph that the past week and a half the memory usage has
>> been mostly flat. There was one spot where the memory ballooned, then a spot
>> where it dropped. That last memory balloon was before the changes we put in
>> place and the drop was where I restarted the glusterd processes (which
>> doesn't affect filesystem access).
>>
>> There are some split brain issues still around from the most recent round of
>> instability, but the SSH keys are perfectly fine. I will not enable password
>> authentication. It's incredibly insecure.
>>
>> So, to get a little more back on point, I've just created
>> bastion2.wmflabs.org and bastion3.wmflabs.org, in case the bastion instances
>> OOM again.
>>
>> - Ryan
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>