On mer, 2002-02-20 at 11:33, lcrocker(a)nupedia.com wrote:
No! No! The text stored in the database is _always_
single-byte
ISO-8859-1, no exceptions, even for the foreign wikis. Some of
those ISO-8859-1 characters may spell out HTML entity references
to Unicode characters outside the set, but the database should not
know or care about that.
I'm sorry you feel that way, but that is in fact NOT TRUE. Please take a
look at the non-English non-ISO-8859-1 wikipedias sometime.
Hundreds of pages, with correct charset headers:
ISO-8859-2:
http://pl.wikipedia.com/
UTF-8 with a custom conversion function for certain character
sequences:
http://eo.wikipedia.com/
Stubs:
CP-1251
http://ru.wikipedia.com/
Shift-JIS
http://ja.wikipedia.com/
GB-2312 with a few character references thrown in:
http://zh.wikipedia.com/
Not sure which encodings, but certainly not ISO-8859-1:
http://ar.wikipedia.com/
http://he.wikipedia.com/
Now, if you honestly think that people are going to edit text that
consists *entirely* of HTML character entity references, you're
obviously not concerned about anything like "ease of use".
On top of which, the consensus seems to be to not allow &s (and thus
character entities) into page titles, which would effectively require
all page titles to be in ASCIIized roman characters. Can you imagine
this being acceptable on, say, the Chinese wiki if anyone actually used
it?
Gee, maybe someone *would* use it if they could use an appropriate
character set for their language!
This policy might have to be changed for the Asian
wikis if something
like shift-JIS is universal enough and dealing with HTML entities
problematic enough to make working with it difficult,
The mind boggles that you might imagine the situation to be otherwise.
but in that
case we'll still standardize on one and only one internal character
representation for that particular wiki. For all others, that
internal representation (and also the encoding which is served via
HTTP) is ISO-8859-1.
Bullshit. Ask the Poles if they'd like to convert their wikipedia to
ISO-8859-1 with HTML character entities.
If you need to "uppercase" words in titles
(as our consensus on
canonization of titles specifies), go ahead and hard-code the
function to deal with ISO-8859-1.
Gee, that would be great if such a function would do anything at all for
anything other than ISO-8859-1 characters. But, somehow I can't quite
see a function hardcoded to deal with ISO-8859-1 being the slightest bit
useful for anything else.
-- brion vibber (brion @
pobox.com)
You Wrote:
I've noticed that the traditional locale-based
case conversion
functions
(ucfirst(), strtolower(), etc) aren't too
reliable for anything but
English. Even when they do work, it's very dependant on the system
configuaration, and thus isn't really transparently portable.
So, I've added new case conversion functions ucfirstIntl(),
strtoupperIntl(), and strtolowerIntl() which can more or less
properly
convert cases in a system-independent manner. For
single-byte
character
encodings this is very simple, based on the PHP
strtr() function;
just
define strings $wikiUpperChars containing all the
uppercase
characters
and $wikiLowerChars containing all the lowercase
chars. (See example
for
iso-8859-1 in wikiTextEn.php)
For multibyte character sets it's a little more complex, using the
same
function in an array mode that associates byte
sequences. Most
multibyte
character sets are for Asian languages which
don't have a case
distinction, so it's not likely to come up often except for those
using
UTF-8. I've included conversion arrays for
UTF-8 in utf8Case.php
which
should cover just about everything, so any future
'pedias that may
use
UTF-8 need just include that (as does
wikiTextEo.php).
Also, it should be possible to extend ucfirstIntl() a bit to allow
for
>multiple-character first letter sequences (for instance treating ij-
>IJ
>as one letter, which I believe is the officially correct behavior for
>Dutch).
>
>-- brion vibber (brion @
pobox.com)
>
>_______________________________________________
>Wikitech-l mailing list
>Wikitech-l(a)ross.bomis.com
>http://ross.bomis.com/mailman/listinfo/wikitech-l
>0