On 4/14/06, Erik Moeller <eloquence(a)gmail.com> wrote:
On 4/13/06, Gregory Maxwell <gmaxwell(a)gmail.com>
wrote:
AFAIR, most string matches in mysql are case
insensitive, which would
mean that we could have indexed case insensitive matches quickly...
but I'm guessing that our use of binary fields for titles (which is
required because no version of mysql has complete UTF-8 support) most
likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact
reason is that we're using varchar binary in MediaWiki for page
titles? I've been using regular varchars for my WiktionaryZ tables,
and so far it seems to work fine with UTF-8. Where exactly does a
non-binary varchar break?
[snip]
Might it make sense to use non-binary strings and use
binary
comparisons instead where case-sensitivity is required?
It is my understanding that MySQL doesn't correctly handle UTF-8 where
the characters are outside of the basic multilingual plane (BMP).
These are the characters which can not be represented by UCS-16 (i.e.
more then two bytes in the UTF-16 representation) and this includes a
number of characters which are required for complete support of some
asian languages.
I'm not sure how it fails (does it munge, or warn, .. since it's mysql
I'm sure it doesn't reject bad data...), since I haven't tried it
personally but thats where I understood the current issue was.
PostgreSQL had simmlar issues with non-bmp characters until version 8
(prior to version 8, PG would reject inserts with most non-bmp
characters as invalid).
There are other issues as well.. for example, last I checked on enwiki
there were some titles (and page/image links) which contained
completely invalid characters... (back from when enwiki was latin1 and
more permissive about binary input not so long ago). I see that a
number of bad pages have been moved to broken/whatever .. so perhaps
thats been cleaned up. ... a switch to utf-8 might potentially run
into problems with this pages, depending on how strict mysqls input
validation is (historically not strict at all... but...), I know it's
always caused me troubles importing into postgresql (because in PG I
don't use binary fields for mediawiki imported data, and PG's input
validation is fairly strict).