On Mon, Nov 17, 2003 at 06:06:00PM +0100, Vincent Ramos wrote:
Cons:
* any text containing non ASCII characters would increase
its weight : instead of one byte for a single <c with cedilla>, it
would require two; French uses lots of non ASCII characters, as
é è ç à ù;
It's not obvious it's going to be bigger at all - you need only 2-3 chars
instead of about 8 (ሴ) for characters not in Latin 1.
French main page:
Latin 1 (as is): 28035
Naively converted to UTF-8: 28235 (0.7% bigger)
With all &entities; UTF-8-ized 27867 (0.6% smaller)
The last number is smaller than Latin 1, but some of it is cheating -
French main page contained &codes; for some characters that could be represented
directly (ü), and some of conversions weren't completely legal
(&->&).
Still, there's no reason to believe UTF-8 is going to be significantly worse
spacewise than ISO-8859-1. For English Wikipedia it's going to be obviously smaller,
as it uses even less characters from 128-255 range.