MySQL 5 is scheduled to come out of beta next month, and we're going to
be looking at upgrading sometime in the coming months. Among other
things we're probably going to want to start making use of the support
for Unicode collation, so we can get better sorting and perhaps use it
for case-insensitive matching.
There is however a compatibility issue: MySQL's Unicode support is
limited to the 16-bit character range (basic multilingual plane), both
for ucs2 and utf8 storage modes.
Characters beyond the BMP are relatively rare, but they do occur. Mostly
in there are ancient/dead scripts, some invented scripts, and a bunch of
rare Han characters which sometimes turn up in Chinese and Japanese.
This won't affect page _contents_; our content is stored in binary blobs
and can have any wacky characters we want. But to support these high
characters in page titles, usernames, and such might require jumping
through a lot of hoops.
It would be relatively simple to disable use of titles and usernames
with these high characters; to assess possible impact I did a check
through all our current wikis and found 99 extant pages:
43 in
en.wiktionary.org
31 in
got.wikipedia.org
10 in
la.wiktionary.org
9 in
zh.wikipedia.org
3 in
so.wikipedia.org
1 in
en.wikibooks.org
1 in
ja.wikipedia.org
1 in
nl.wikibooks.org
I've put the full list of pages here:
http://meta.wikimedia.org/wiki/User:Brion_VIBBER/Unicode_high_chars
Most of the en.wiktionary entries are individual letters in the Deseret
and Shavian alphabets (invented alphabets for English; historical
curiosities).
The Gothic alphabet is entirely in the high-character area, but it's a
long-dead language and not exactly an active wiki. Perhaps we should
just close it down...
Latin Wiktionary contains several Gothic terms...
The Chinese Wikipedia contains several apparently legitimate articles
(from what I can tell) using high characters; these might have to be
moved. The Japanese Wikipedia has one redirect with such a character.
The Somali Wikipedia contains three one-sentence stub pages pages using
the Osmanya script; Omniglot's article on it says this script is no
longer in use since adoption of the Latin alphabet in 1972.
English Wikibooks has a user account with a Gothic-script name, which
has edited a number of pages about the Gothic language and has a user page.
Dutch Wikibooks has one Gothic-titled redirect.
-- brion vibber (brion @
pobox.com)