Last week I added some statistics reporting to our search daemon. It keeps track
of a 1-minute rolling average of the rate of handled requests, discarded
requests, the time it takes to service a request, and the number of
simultaneously active threads.
This info is reported to Ganglia, and can be watched at eg:
http://ganglia.wikimedia.org/large/?m=search_rate&r=hour&s=descendi…
http://ganglia.wikimedia.org/large/?m=search_time&r=hour&s=descendi…
With a better idea of the actual performance of the system, we've been able to
put some work into optimizing the system a little better.
First, several more old Apache boxes have been commandeered, increasing the
search cluster from 3 to 8 machines. Second, Tim has switched the load balancing
from simple round-robin plus failover to a more flexible and cleaner system
using perlbal.
We found that the boxes with 3+ gigabytes of RAM performed significantly better
than the boxes with only 1 gig, probably because they could not dedicate as much
memory to caching the on-disk index files.
As of today, the cluster has been split into two groups, each separately managed
by perlbal. The four 3+-gigabyte machines handle
en.wikipedia.org and
de.wikipedia.org, our two biggest and most active wikis, and the 1-gigabyte
machines handle everything else. We'll know better during peak hours tomorrow,
but so far it looks pretty good; reported dropped connections have nearly
vanished, and average service times are below 50ms for all boxen.
Future work:
River has done some work on fancying up the search for Wikia, but we haven't yet
gotten a clear agreement on whether or not the company is willing to open-source
it. If they do do this soon, we may adopt Wikia's code.
If not, we'll continue working on the base we've got to spiffy it up. The first
order of business is doing another round of comparisons on the base VM:
currently we're running on Mono, which was chosen originally for the combination
of being 1) open source, 2) reasonably performant, 3) not leaking memory. GCJ
was a touch faster, but leaked memory. Sun's JVM didn't leak memory, but
isn't
quite open-source. Somewhere along the line, though, the Mono version sprung a
memory leak and we have to restart the daemon regularly to keep it from dying.
I'm uncertain whether this is in our code, in the Lucene port, or in Mono itself.
I'll want to check with a more current update to the C# Lucene port, and update
the Java code to test against current versions of GCJ/Classpath and Sun's JVM,
now that Lucene 2.0 is available.
Another important improvement we could make is better indexing updates: we
should at least be able to add new pages to the index in close to real time,
even if full rebuilds are still more intermittent.
And if it's ready for another release, I may check out the Sphinx search engine
as well. It claims better speed and result ordering than Lucene, but when I was
first testing it out it was too much in flux with a lot of new stuff going into
the development version.
-- brion vibber (brion @
pobox.com)