Re: [Wikitech-l] Re: Re: Secondary servers

13 Jan 2004

Gabriel Wicke wrote:
...
  On Tue, 13 Jan 2004 00:07:50 +0000, Nick Hill wrote:

...
  The most
commonly used pages are going to be in the memory of the 
database server so these are not costly to serve. The costly pages to 
serve are those which need disk seeks to serve. The more I/O seek 
operations a page requires, the more costly it is to serve.  

 Yup. So lets avoid them. 
Given that popular articles will be in the database memory cache, 
requests for popular articles should not lead to database HDD seeking.

I would expect a Squid proxy be best at serving popular pages and poor 
at serving less popular pages. So I can't imagine how squid is very 
helpful at saving HDD seeks.

...
  The proxy
server will need to make a database lookup (for the URL)   

 Nope. Only if a page is *not* in the cache or marked as not cacheable. I meant the
squid server will need to look up it's own database (in 
whatever form that may be- filesystem or indexed DBMS) to check if it 
has a copy of the required data- using the URL as a key. If it has a 
copy of the data, it will need to either pull it out of memory or from 
the disk. If not, forward the request then add the page to it's own 
database.

If the squid server needs to pull an article page off the disk, then 
disk I/O will be required in the same way as it would be required if an 
uncached piece of data is read from the database by the web server.

As I/O is the bottleneck, the squid server is likely to suffer the same 
problems as the underlying database server. These problems are likely to 
be bigger as the chunks of fine grained data handled by squid will be 
larger (compressed fully formed HTML pages) than the fine grained chunks 
of data (article text) handled by the database server.

...
  If performance
is the criteria, I suggest a proxy isn't a good idea.   

 Well- please read up some docs. Or benchmark http://www.aulinx.de/ -
 commodity server (Celeron 2Ghz) running Squid. 
I am not contending that squid is not a very high performance server. I 
believe it is a high performance server. I believe it can substantially 
reduce the bandwidth needed by ISPs to serve web surfers.

The issue for wikipedia is how many disk accesses, in total, are needed 
for each article hit.

Wikipedia has millions of discrete pieces of data. Most pieces 
referenced singularly by a unique URL. Squid will not be able to hold a 
substantial proportion of these in memory. Squid will be able to hold 
fewer of these data chunks (articles/ HTML pages) in a given amount of 
memory than a database server could as the articles stored on the 
database consist of less data than the article rendered in HTML. For the 
larger articles, the comressed html page will be smaller, but for most 
articles, the compressed HTML page will be bigger. (the relative weight 
of the page HTML is much greater for short articles than for long 
articles. Compression reduces page size by about half.)

I assume viewing an article history page requires several pieces of 
information leading to multiple seeks per request: If squid were able to 
serve article histories, then a single I/O on the squid box could save 
several database seeks on the database server, providing a substantial 
economy. However, individual page histories are each fairly rare and 
forcing a squid cache reload on an article history page when an article 
is updated may be a poor use of resources.

I suggest four avenues for investigation:
1) Store articles in the MySQL table in compressed (gzip) format. This 
will reduce the size of the articles, making them fit more easily into 
the available cache memory, increasing the chances of a cache hit almost 
by a factor of two. Perhaps this can be made as a patch to MySQL.
2) Investigate ways of prioritising data cached in memory such that 
smaller chunks of data have a higher value than larger chunks so that 
smaller chunks are not flushed according to the basic least recently 
used algorithm. To reflect the relative cost of reading a small chunk of 
data from the HDD.
3) If the SQL code underlying wikipedia relies on temporary tables as 
part of the SQL queries, investigate whether the I/O of writing 
temporary tables tends to flush data from the disk cache. If so, write 
temporary tables to ramdisk or other storage which does not cause 
flushing. More recent versions of MySQL support sub-queries. This may 
obviate the need for temporary tables.
4) Judicious use of solid state storage. Could dramatically reduce seek 
times and I/O bottleneck. Some issues to resolve regarding flash memory 
durability and possible MySQL hotspots. Also cost of mass solid state 
storage. Might be worthwhile for some wiki data.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Re: Secondary servers