Erik Moeller wrote:
On 4/13/06, Gregory Maxwell <gmaxwell(a)gmail.com>
wrote:
AFAIR, most string matches in mysql are case
insensitive, which would
mean that we could have indexed case insensitive matches quickly...
but I'm guessing that our use of binary fields for titles (which is
required because no version of mysql has complete UTF-8 support) most
likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact
reason is that we're using varchar binary in MediaWiki for page
titles? I've been using regular varchars for my WiktionaryZ tables,
and so far it seems to work fine with UTF-8. Where exactly does a
non-binary varchar break?
It could break string matching, but would definitely break sorting. (Sorting by
codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a
*different character set* from the one really you're using seems pretty silly.
You get unpredictable, incorrect sorting and potentially have strings rejected
as invalid.
Might it make sense to use non-binary strings and use
binary
comparisons instead where case-sensitivity is required?
When MySQL finally reaches the mid-1990s and supports full Unicode, we'll be
able to do that.
-- brion vibber (brion @
pobox.com)