Gabriel Wicke wrote:
On Tue, 13 Jan 2004 00:07:50 +0000, Nick Hill wrote:
The most
commonly used pages are going to be in the memory of the
database server so these are not costly to serve. The costly pages to
serve are those which need disk seeks to serve. The more I/O seek
operations a page requires, the more costly it is to serve.
Yup. So lets avoid them.
Given that popular articles will be in the database memory cache,
requests for popular articles should not lead to database HDD seeking.
I would expect a Squid proxy be best at serving popular pages and poor
at serving less popular pages. So I can't imagine how squid is very
helpful at saving HDD seeks.
The proxy
server will need to make a database lookup (for the URL)
Nope. Only if a page is *not* in the cache or marked as not cacheable.
I meant the
squid server will need to look up it's own database (in
whatever form that may be- filesystem or indexed DBMS) to check if it
has a copy of the required data- using the URL as a key. If it has a
copy of the data, it will need to either pull it out of memory or from
the disk. If not, forward the request then add the page to it's own
database.
If the squid server needs to pull an article page off the disk, then
disk I/O will be required in the same way as it would be required if an
uncached piece of data is read from the database by the web server.
As I/O is the bottleneck, the squid server is likely to suffer the same
problems as the underlying database server. These problems are likely to
be bigger as the chunks of fine grained data handled by squid will be
larger (compressed fully formed HTML pages) than the fine grained chunks
of data (article text) handled by the database server.
If performance
is the criteria, I suggest a proxy isn't a good idea.
Well- please read up some docs. Or benchmark
http://www.aulinx.de/ -
commodity server (Celeron 2Ghz) running Squid.
I am not contending that squid is not a very high performance server. I
believe it is a high performance server. I believe it can substantially
reduce the bandwidth needed by ISPs to serve web surfers.
The issue for wikipedia is how many disk accesses, in total, are needed
for each article hit.
Wikipedia has millions of discrete pieces of data. Most pieces
referenced singularly by a unique URL. Squid will not be able to hold a
substantial proportion of these in memory. Squid will be able to hold
fewer of these data chunks (articles/ HTML pages) in a given amount of
memory than a database server could as the articles stored on the
database consist of less data than the article rendered in HTML. For the
larger articles, the comressed html page will be smaller, but for most
articles, the compressed HTML page will be bigger. (the relative weight
of the page HTML is much greater for short articles than for long
articles. Compression reduces page size by about half.)
I assume viewing an article history page requires several pieces of
information leading to multiple seeks per request: If squid were able to
serve article histories, then a single I/O on the squid box could save
several database seeks on the database server, providing a substantial
economy. However, individual page histories are each fairly rare and
forcing a squid cache reload on an article history page when an article
is updated may be a poor use of resources.
I suggest four avenues for investigation:
1) Store articles in the MySQL table in compressed (gzip) format. This
will reduce the size of the articles, making them fit more easily into
the available cache memory, increasing the chances of a cache hit almost
by a factor of two. Perhaps this can be made as a patch to MySQL.
2) Investigate ways of prioritising data cached in memory such that
smaller chunks of data have a higher value than larger chunks so that
smaller chunks are not flushed according to the basic least recently
used algorithm. To reflect the relative cost of reading a small chunk of
data from the HDD.
3) If the SQL code underlying wikipedia relies on temporary tables as
part of the SQL queries, investigate whether the I/O of writing
temporary tables tends to flush data from the disk cache. If so, write
temporary tables to ramdisk or other storage which does not cause
flushing. More recent versions of MySQL support sub-queries. This may
obviate the need for temporary tables.
4) Judicious use of solid state storage. Could dramatically reduce seek
times and I/O bottleneck. Some issues to resolve regarding flash memory
durability and possible MySQL hotspots. Also cost of mass solid state
storage. Might be worthwhile for some wiki data.