[Foundation-l] State of technology: 2007

Fri Jan 4 02:07:04 UTC 2008

Hello colleagues and shareholders (community :)!

Has been a while since my last review of operations (aka hosting  
report) - so I will try to overview some of things we've been doing =)
First of all, I'd like to thank mr.Moore for his fabulous law. It  
allowed Wikipedia to stay alive - even though we had to grow again in  
all directions.

We still have Septembers. Well, it is a nice name to describe the  
recurring pattern, which provides Shock and Awe to us - after a  
period of stable usage, every autumn number of users suddenly goes up  
and stays there - to allow us think we've finally reached some  
saturation and will never grow more. Until next September.
We still have World Events. People rush to us to read about conflicts  
and tragedies, joys and celebrations. Sometimes because we had  
information for ages, sometimes because it all matured in seconds or  
minutes. Nowhere else document can require that much of concurrent  
collaboration, and nowhere else it can provide as much value  
immediately.
We still have history. From day one of the project, we can see people  
going into dramas, discussing, evolving and revolving every idea on  
the site. Every edit stays there - accumulating not only final pieces  
of information, but the whole process of assembling the content.
We still advance. Tools to facilitate the community get more complex,  
we start growing ecosystem of tools and processes inside and outside  
core software and platform. Users are the actual developers of the  
project, core technology just lags behind assisting.

Our operation becomes more and more demanding - and thats quite a bit  
of work to handle.

Ok, enough of such poetic introduction :)

== Growth ==

Over second half of 2006 traffic and reqeuests to our cluster doubled  
(actually, that happened just in few months)
Over 2007 traffic and requests to our cluster doubled.

Pics:
	http://www.nedworks.org/~mark/reqstats/trafficstats-yearly.png
	http://www.nedworks.org/~mark/reqstats/reqstats-yearly.png

== Hardware expansion ==

Back in September 2006 we had quite huge load increase, and we went  
for capacity expansion, which included:
* 20 new Squid servers ($66k)
* 2 storage servers ($24k)
* 60 application servers ($232k)

German foundation additionally assisted with purchasing 15 Squid  
servers in November for Amsterdam facility.

Later in January 2007 we added 6 more database servers (for $39k),  
three additional application servers for auxiliary tasks (such as  
mail), and some network and datacenter gear.

The growth over autumn/winter led us to quite big ($240k) capacity  
expansion back in March, which included:
* 36 very capable 8-core application servers (thank you Moore yet  
again :) - that was around $120k
* 20 Squid servers for Tampa facility
* Router for Amsterdam facility
* Additional networking gear (switches, linecards, etc) for Tampa

The only serious capacity increase afterwards was another  
'German' (thanks yet again, Verein) batch of 15 Squid servers for  
Amsterdam in December 2007.

We do plan to improve on database and storage servers soon - that  
would add to stability of our dumps building and processing, as well  
as better support for various batch jobs.

We have been especially pushy about exploiting warranties on all  
servers, and nearly all machines ever purchased are in working state,  
doing one or another kind of workload. All the veterans of 2005 are  
still running at amazing speeds doing the important jobs :)
Rob joining to help us with datacenter operations has allowed to have  
really nice turnarounds with pretty much every datacenter work - as  
volunteer remote hands became not available during critical moments  
anymore. Oh, and look how tidy cabling is: http://flickr.com/photos/ 
midom/2134991985/ !

== Networking ==

This has been mainly in capable Mark's and River's hands - where we  
underwent transition from hosting customer to internet service  
provider (or at least - equal peer to ISPs) ourselves. We have our  
independent autonomous systems both in Europe and US - allowing to  
pick best available connectivity options, resolve routing glitches,  
and get free traffic peering at internet exchanges. That provides  
quite lots of flexibility, of course, at the cost of more work and  
skills required.

This is also part of overall well-managed powerful datacenter  
strategy. Instead of low-efficiency small datacenters scattered  
around the world, core facility like one in Amsterdam provides high  
availability, close proximity to major Internet hubs and carriers,  
and is generally in center of region's inter-tubes. Though it would  
be possible to reach out into multiple donated hosting places, that  
would just lead to slower service for our users, and someone would  
still have to pay for the bandwidth. As we are pushing nearly 4 Gbps  
of traffic, there're not much donors who wouldn't feel such traffic.

== Software ==

There has been lots of overall engineering effort, that was often  
behind the scenes. Various bits had to be rewritten to act properly  
on user activity. The most prominent example of such work is Tim's  
rewrite of parser to more efficiently handle huge template  
hierarchies. In perfect case, users will not see any visible change,  
except multiple-factor faster performance at expensive operations.
In past year, lots of activities - how people use customized software  
- bots, javascript extensions, etc - have changed performance  
profile, and nowadays lots of performance work at backend is to  
handle various fresh activities - and anomalies.
One of core activities was polishing caching of our content, so we  
could have our application layer to concentrate on most important  
process - collaboration, instead of content delivery.
Lots and lots of small things have been added or fixed - though some  
developments where quite demanding - like multimedia integration,  
which was challenging due to our freedom requirements.
Still, there was constant tradeoff management, as not every feature  
was worth the performance sacrifice and costs, and on the other hand  
- having the best possible software for collaboration is also  
important :) Introducing new features, or migrating them from outside  
to the core platform has been always serious engineering effort.  
Besides, there would be quite a lot of communication - explaining how  
things have to be built for them not to collapse at live site,  
discussing security implications, change of usage patterns, ...
Of course, MediaWiki is still one of most actively developed web  
software - and here Brion and Tim lead the volunteers, as well, as  
spend their days and nights in the code.

At the overall stack, we have worked at every layer - tuning kernels  
for our high-performance networking, experimenting with database  
software (some servers are running our own fork of MySQL, based on  
Google changes), perfecting Squid (Mark and Tim ended up in authors  
list) - our web caching software, digging into problems and  
specialties of PHP engine. Quite a lot of problems we hit are very  
huge-site-specific, and even if other huge shops hit them, we're the  
ones that are always free to release our changes and fixes. Still,  
colleagues from other shops are willing to assist us too :)

There were lots of tiny architecture tweaks - that allowed us to use  
resources more efficiently, but none of them are any major - pure  
engineering all the time. It seems, that lately we stabilized on lots  
of things in how Wikipedia works - and it seems to work quite  
fluently. Of course, one must mention Jens' keen eye, taking care of  
various especially important but easily overlooked things.

River has dedicated lots of attention to supporting the community  
tools infrastructure at the Toolserver - and also maintaining off- 
site copies of projects.

Site doesn't fall down the very same minute nobody is looking at it,  
and it is quite an improvement over the years :)

== Notes ==

People have been discussing if running a popular site is really a  
mission of WMF. Well, users created magnificent resource, we try to  
support it, we do what we can. Thanks to everyone involved - though  
it has been far less stressful ride than previous years, still, nice  
work. ;-)

== More reading ==

May hurt your eyes: https://wikitech.leuksman.com/view/Server_admin_log
Platform description: http://dammit.lt/uc/workbook2007.pdf

== Disclaimer ==

Some numbers can be wrong, as this review was based not on audit, but  
on vague memories :)

-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]