[Labs-l] Random issues that require an OPs attention to fix

Damian Zaremba damian at damianzaremba.co.uk
Sat Oct 6 18:53:27 UTC 2012


On 06/10/2012 19:43, Ryan Lane wrote:
> On Sat, Oct 6, 2012 at 9:43 AM, Damian Zaremba
> <damian at damianzaremba.co.uk> wrote:
>> 1) DNS is broken/half working/annoying/argh
>> phoenix:~ damian$ dig wmflabs.org NS @labs-ns0.wikimedia.org
>>
>> ; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org NS @labs-ns0.wikimedia.org
>> ;; global options: +cmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17397
>> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>> ;; WARNING: recursion requested but not available
>>
>> ;; QUESTION SECTION:
>> ;wmflabs.org.            IN    NS
>>
>> ;; Query time: 150 msec
>> ;; SERVER: 208.80.152.33#53(208.80.152.33)
>> ;; WHEN: Sat Oct  6 17:33:03 2012
>> ;; MSG SIZE  rcvd: 29
>>
>> phoenix:~ damian$ dig wmflabs.org NS @labs-ns1.wikimedia.org
>>
>> ; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org NS @labs-ns1.wikimedia.org
>> ;; global options: +cmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46082
>> ;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
>> ;; WARNING: recursion requested but not available
>>
>> ;; QUESTION SECTION:
>> ;wmflabs.org.            IN    NS
>>
>> ;; ANSWER SECTION:
>> wmflabs.org.        3600    IN    NS    labs-ns1.wikimedia.org.
>> wmflabs.org.        3600    IN    NS    labs-ns0.wikimedia.org.
>>
>> ;; Query time: 175 msec
>> ;; SERVER: 208.80.154.19#53(208.80.154.19)
>> ;; WHEN: Sat Oct  6 17:33:09 2012
>> ;; MSG SIZE  rcvd: 85
>>
>> Also, the SOA is wrong as it still points to virt0;
>> phoenix:~ damian$ dig wmflabs.org SOA @labs-ns1.wikimedia.org
>>
>> ; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org SOA @labs-ns1.wikimedia.org
>> ;; global options: +cmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46569
>> ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
>> ;; WARNING: recursion requested but not available
>>
>> ;; QUESTION SECTION:
>> ;wmflabs.org.            IN    SOA
>>
>> ;; ANSWER SECTION:
>> wmflabs.org.        3600    IN    SOA    virt0.wikimedia.org.
>> hostmaster.wikimedia.org. 1349449000 1800 3600 86400 7200
>>
>> ;; Query time: 128 msec
>> ;; SERVER: 208.80.154.19#53(208.80.154.19)
>> ;; WHEN: Sat Oct  6 17:33:39 2012
>> ;; MSG SIZE  rcvd: 92
>>
>>
> Seems the DNS servers are only pointing at a single LDAP backend, and
> the LDAP backend went non-responsive for a little while. I added a bug
> for this:
>
> https://bugzilla.wikimedia.org/show_bug.cgi?id=40825
>
>> 2) Instance reboots tend to result in instances never coming back - please
>> could someone fix bots-cb (same as sql2, first reboot took it down, second
>> results in 'failed').
>>
> Due to the same issue as sql2. It wasn't defined in libvirt. This is
> likely due to when we did the cold migrations off the old hardware.
> I'm going to run a script to solve this problem for any future
> reboots, on monday.
>
>> 3) Login's randomly fail due to key auth timing out (seems to be related to
>> nfs crapping out)
>>
> Due to DNS
>
>> 4) Home dirs sometimes randomly drop their mounts (seems to be related to
>> nfs crapping out also, dmesg just shows rpc timeouts)
>>
> Due to DNS
>
>> (Yes, I know it's a Saturday but as the guy in Code Rush said; Writing
>> software is different from selling real estate. Selling real estate you sell
>> the people the people sleep at night. When they go to sleep you have to stop
>> selling real estate. Computers never sleep.)
>>
> Meh. No problems there. If something is broken I'm going to fix it
> whether it's Saturday or not ;).
>
> - Ryan
// Forwarding reply to list

bots-cb seems to still not be back, never started pinging again after 
you rebooted it. Trying a reboot still reports 'Failed' with no console 
output. I assume it failed to boot for some reason but have no other 
means of poking it :)

Damian



More information about the Labs-l mailing list