[Labs-l] Outage report

Fri Jun 1 13:29:07 UTC 2012

Yes. All instances were rebooted. Everything should be working now.

On Fri, Jun 1, 2012 at 2:45 PM, Shujen Chang <i at blue.cat> wrote:
> all instances? is it ok now?
>
>
> On Friday, June 1, 2012, Ryan Lane wrote:
>>
>> I'm now going to reboot the instances, since it'll bring the swapping
>> down for a while.
>>
>> On Fri, Jun 1, 2012 at 12:24 PM, Ryan Lane <rlane32 at gmail.com> wrote:
>> > We're currently having a Labs outage. The nfs server because
>> > non-responsive, causing a cascading failure. I'm suspending instances
>> > currently, until load comes down. Once load is under control I'll
>> > slowly resume instances. Soon, we'll be doing the following things to
>> > ensure this doesn't continue to occur:
>> >
>> > 1. We're moving away from glusterfs to local storage on the virtual
>> > nodes until we find another more appropriate solution
>> > 2. We're getting rid of the labs-nfs1 instance, and will move the home
>> > directories to project storage
>> > 3. We're adding more (and better) hardware, that will lead to less
>> > swapping, which will lead to less IO
>> >
>> > Sorry about the experience as of late, I'm looking forward to
>> > improving the situation for us.
>> >
>> > - Ryan
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> --
> Sincerely,
> Shujen Chang
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>