Another important point which occurs to me;
The article size on Wikipedia varies dramatically but average 1400
bytes. The size of entries in the old table- the updates- are likely to
average even smaller.
By default, ext2/ext3 filesystem blocks tend to be 4k in size. I haven't
yet investigated how big cached disk pages are in the Linux kernel-
whether they mirror the size of filesystem blocks or are a multiple, or
are fixed in number and vary in size with available memory etc.
If the cache page size can be tuned to the data atom size (which I feel
it can, even if this needs a header value change + recompile) - that is
tuned to the median average data size of articles, the number of
popular articles held in a given amount of ram can possibly be increased
by several fold. This will have a cost, perhaps not very big, in terms
of scheduler work load and, if the pages are very small, memory needed
by the kernel to administer a massive array of cache pages.
This solution would, nevertheless, be less optimal than a suitable solid
state storage solution for the busiest wikis.
Nick Hill wrote:
I suggest four avenues for investigation:
1) Store articles in the MySQL table in compressed (gzip) format. This
will reduce the size of the articles, making them fit more easily into
the available cache memory, increasing the chances of a cache hit almost
by a factor of two. Perhaps this can be made as a patch to MySQL.
2) Investigate ways of prioritising data cached in memory such that
smaller chunks of data have a higher value than larger chunks so that
smaller chunks are not flushed according to the basic least recently
used algorithm. To reflect the relative cost of reading a small chunk of
data from the HDD.
3) If the SQL code underlying wikipedia relies on temporary tables as
part of the SQL queries, investigate whether the I/O of writing
temporary tables tends to flush data from the disk cache. If so, write
temporary tables to ramdisk or other storage which does not cause
flushing. More recent versions of MySQL support sub-queries. This may
obviate the need for temporary tables.
4) Judicious use of solid state storage. Could dramatically reduce seek
times and I/O bottleneck. Some issues to resolve regarding flash memory
durability and possible MySQL hotspots. Also cost of mass solid state
storage. Might be worthwhile for some wiki data.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l