(I just posted the following to the tech blog,
http://techblog.wikimedia.org)
Last Monday, our Solaris server that contains all image thumbnails developed problems. It
ran out of memory, became too slow and eventually even started to crash. (For the
technically inclined: we think the kernel is leaking some file system structure in kernel
memory.) This caused missing thumbnails across Wikimedia projects.
We addressed these problems in the following ways:
* We decreased the load on this server by adapting the Squid configuration, so it would
have to handle fewer requests.
* We ordered more memory, in order to double the total physical memory in the relevant
systems.
* We set up two new Linux servers that will eventually replace the Solaris server.
At first, the addition of these Linux servers in a partially caching setup seemed enough
to fix the immediate problem, while gradually copying all thumbnail files, allowing us to
replace the Solaris server completely.
However, on Saturday night the Solaris server started crashing repeatedly, making it
necessary to engage the image scalers to regenerate a large part of the missing
thumbnails. This is causing some slowness of loading and generating new (uncached)
thumbnails.
Fortunately, most users have not experienced serious problems while using the site, since
most thumbnails are cached by our HTTP caching layer. It is impossible to determine
exactly how long it will take to recover completely from the slower service, but we expect
that this will take no more than a few days.
Over the past months we have been developing a new and more scalable architecture for
media storage, which will solve these problems once and for all. We hope to deploy this
new architecture within a few months, also utilizing the new data center. Please watch the
Tech Blog for updates on this project.
--
Mark Bergsma <mark(a)wikimedia.org>
Operations Engineering Program Manager
Wikimedia Foundation