First, has there been any configuration changes shortly before the problem began?
The first rule is "look for stupidity", as in an error in configuration causing
a self-DOS. Many of us have done that to ourselves, to our embarrassment.
If not, go with Tim's suggestion and also look at squid's logs. Are you getting
requests, but no full session (syn flood)?
I'm on your site periodically. It's normally smoothly running, since you went with
Linode.
The site is overall well behaved.
However, it is one that could easily become the target of a script kiddie.
So, do you have SYN cookies turned on?
I'm a sysadmin/netadmin, but I'm a bit colored from my information security
experience. Hence, I always have to re-remind myself that stupidity is the most frequent
cause of a problem, malicious intent the last.
The large number of httpd daemons can be php hits or SYN flooding, in a non-squid
environment or even with a creatively crafted attack. The latter is beyond rare for
anything non-super profile in nature (think Fortune 500 and government scale for that).
But, the most common is a burst of intra-cranial flatulence or a case of fat fingers.
So, look again at the logs and processes during the slug convention. Look from Tim's
suggested perspective. If you can't find anything there, look closer at squid and
connection based events.
When working for the US DoD, our most common DOS was self-inflicted. In an environment
where we were incessantly having DDOS, general DOS and every other form of attack
attempted.
Two, inflicted by my own humble fat fingers. :/
On Apr 21, 2013, at 11:53 PM, Tim Starling wrote:
On 21/04/13 05:29, David Gerard wrote:
So where would I start looking to work out
what's going on?
If there is any kind of site issue at WMF, I usually start with
Ganglia. It does take some practise to be able to read it correctly,
but it gives you information far more quickly than just about anything
else. My notes on WMF incident response give some hints about how to
use it, as well as discussing some other tools:
https://wikitech.wikimedia.org/wiki/Incident_response
If the problem seems to be downstream of MediaWiki, then profiling is
usually the next thing to look at. Wikipedia has been using DIY
profiling to diagnose site performance issues since it was on a single
server.
* Sometimes it isn't, e.g. this afternoon
when the site was running
like a slug and load average was 0.8 with nothing amiss in top.
Processes in the "S" state do not contribute to the load average,
whether or not users are waiting for them. For example, PHP may be
waiting for Lucene. Try the section in the incident response notes
under "slow backend service".
-- Tim Starling
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l