== Access patterns ==
I did some stats ages ago, showing an approximately Zipfian distribution
for page accesses. A bit of calculation shows that for small numbers of
articles, this means a small amount of cache will give a large
performace boost. However, as the number of articles increases, the
vast majority of seldom-accessed articles will start to dominate the
behavior of the system for article fetches. Thus, RAM caching will
decrease in usefulness over time as the project progresses, unless the
RAM cache is close to the size of the entire working set.
Nick's suggestion of tuning the filing system page size to the article
size is a good idea; it will tend to make the RAM cache which is
currently available more effective. I'm rather dubious about some of his
other suggestions.
Where RAM caching is really important is in the "hot" data such as
article timestamps and link tables. These have already been partially
addressed by the use of memcached, I believe. These commonly accessed
pieces of data should be small enough to keep in RAM all the time,
giving a large speedup to the system.
== Seek bound performance ==
Since disk I/O requests are effectively random, the load will be
dominated by seek and rotational latency. It will cost very nearly the
same to pick 64kbytes off the disk for an article as to get 4 bytes for
a timestamp.
Using high-performance disks and spreading the database across many RAID
spindles should greatly increase performance.
I agree with the posters who are arguing for software RAID: it has
higher performance than hardware RAID in many cases, and again, we can
fine-tune stripe sizes etc. to our application. (Big stripe sizes are a
bad idea for random-seek loads, but give better performance for
streaming loads). We should also consider kernel 2.6: there are major
gains in disk I/O performance in this kernel, and most of the teething
troubles are not related to server issues.
== Not all disks are equal ==
Consider buying the disks specifically by access time statistics. In
particular, high-performance SCSI disks should greatly out-perform IDE
for random seek access patterns, even though their performance may be
roughly the same for data streaming. SCSI command tagging will further
increase performance, where there is concurrency on a single spindle.
See
http://www.storagereview.com/php/benchmark/bench_sort.php for some
interesting stats:
* a Fujitsu MAS3735 has an average read access time of 5.6ms, for a
price of $700 for 73 GB.
* a Hitachi Deskstar 7K250 has an average read access time of 12.1ms,
for a price of $250 for 250 GB
* a seagate U6 has an average read access time of 20.0ms, for a price of
??? for 80 GB
According to this, if performance is dominated by read access time, the
most expensive drive should have almost four times the random-read
performance of the cheapest, all else being equal.
Using price and performance figures such as those above, we should be
able to calculate the best price/performance/storage compromise for this
application.
== The Google strategy for article caching ==
Google seem to use a large number of RAM-based cache servers, based on
the observation that network access latency on a small network is tiny,
but disk latency is large. This does not make any sense for us now: we
don't have the resources, unless Google open-source their Google
filesystem.
For future expansion, it might be cheaper to buy 10 4Gbyte RAM commodity
machines than one 40 Gbyte enterprise-class machine, and spread the load
across them. Although this would still be costly, the performance of
serving data directly from RAM would be very high.
-- Neil
*/
/*