On 4/13/06, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:
AFAIR, most string matches in mysql are case
insensitive, which would
mean that we could have indexed case insensitive matches quickly...
but I'm guessing that our use of binary fields for titles (which is
required because no version of mysql has complete UTF-8 support) most
likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact
reason is that we're using varchar binary in MediaWiki for page
titles? I've been using regular varchars for my WiktionaryZ tables,
and so far it seems to work fine with UTF-8. Where exactly does a
non-binary varchar break?
Also, according to the MySQL 5.0 documentation:
http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
<quote>
The BINARY and VARBINARY data types are distinct from the CHAR BINARY
and VARCHAR BINARY data types. For the latter types, the BINARY
attribute does not cause the column to be treated as a binary string
column. Instead, it causes the binary collation for the column
character set to be used, and the column itself contains non-binary
character strings rather than binary byte strings. For example,
CHAR(5) BINARY is treated as CHAR(5) CHARACTER SET latin1 COLLATE
latin1_bin, assuming that the default character set is latin1. This
differs from BINARY(5), which stores 5-bytes binary strings that have
no character set or collation.
</quote>
Since we are using VARCHAR(255) BINARY for page titles, rather than
VARBINARY(255), does that mean that "it causes the binary collation
for the column character set to be used"? If so, does the use of
VARCHAR(255) BINARY, as opposed to a simple VARCHAR(255), affect
anything but the sorting order (collation)?
Might it make sense to use non-binary strings and use binary
comparisons instead where case-sensitivity is required?
Erik