Neil Harris wrote:
It is that MySQL 5 cannot support characters outside
the BMP at all, or
just that it can't collate them properly? If it just handles > BMP UTF-8
sequences as binary data, might it simply sort them in Unicode code
point order?
Or does it do something worse, and actively convert the Unicode
characters into a 16-bit range, thus nuking characters outside the BMP.
rather than storing, and largely processing, them as binary-encoded data
for purposes other than collating?
I tested this yesterday, hence my post. To summarize the results:
Using a literal UTF-8 4-byte character in SQL statement, with connection
on 'SET NAMES utf8' mode:
* utf8 column: string is truncated at the problem character
* ucs2 column: "????" is stored in place of problem character
* blob column: works just fine (but no collation)
Using pseudo-UTF-8 with UTF-16 surrogate pair halves individually encoded:
* utf8 column: works, but now we have bad encoding
* ucs2 column: works, but now we have bad encoding
* blob column: works, but now we have bad encoding
They won't be properly collated I'm sure, either.
In theory we could apply this tranformation but this will add a bunch of
unnecessary and unreliable junk to the code. Automatically applying the
transformation on all data could badly break binary storage (eg
compressed text, the stuff we Really Don't Want To Lose).
If we apply it to page titles only, we might be able to get away with
adding the transformation in eg the Title class:
* $title->getText() -> proper UTF-8, with spaces
* $title->getUrl() -> proper UTF-8, with underscores
* $title->getDbKey() -> fake UTF-8, with underscores
This of course means there's a nasssssty database dependency in the
database-independent code, and could still break other things.
My preference, if possible, would be to get MySQL to fix their Unicode
support to allow for either storage of full UTF-8 or proper
transformation of UTF-8 to UTF-16. UCS-2 collation with UTF-16
conversion semantics would be "good enough" for us, I think, and avoids
the 4-byte-per-character index bloat of extending the UTF-8 support.
-- brion vibber (brion @
pobox.com)