Re: [Wikitech-l] Re: Unicode high characters versus MySQL 5

15 Oct 2005

Neil Harris wrote:
...
  It is that MySQL 5 cannot support characters outside
the BMP at all, or
 just that it can't collate them properly? If it just handles > BMP UTF-8
 sequences as binary data, might it simply sort them in Unicode code
 point order?

 Or does it do something worse, and actively convert the Unicode
 characters into a 16-bit range, thus nuking characters outside the BMP.
 rather than storing, and largely processing, them as binary-encoded data
 for purposes other than collating? 
I tested this yesterday, hence my post. To summarize the results:

Using a literal UTF-8 4-byte character in SQL statement, with connection
on 'SET NAMES utf8' mode:
* utf8 column: string is truncated at the problem character
* ucs2 column: "????" is stored in place of problem character
* blob column: works just fine (but no collation)

Using pseudo-UTF-8 with UTF-16 surrogate pair halves individually encoded:
* utf8 column: works, but now we have bad encoding
* ucs2 column: works, but now we have bad encoding
* blob column: works, but now we have bad encoding

They won't be properly collated I'm sure, either.

In theory we could apply this tranformation but this will add a bunch of
unnecessary and unreliable junk to the code. Automatically applying the
transformation on all data could badly break binary storage (eg
compressed text, the stuff we Really Don't Want To Lose).

If we apply it to page titles only, we might be able to get away with
adding the transformation in eg the Title class:

* $title->getText() -> proper UTF-8, with spaces
* $title->getUrl() -> proper UTF-8, with underscores
* $title->getDbKey() -> fake UTF-8, with underscores

This of course means there's a nasssssty database dependency in the
database-independent code, and could still break other things.

My preference, if possible, would be to get MySQL to fix their Unicode
support to allow for either storage of full UTF-8 or proper
transformation of UTF-8 to UTF-16. UCS-2 collation with UTF-16
conversion semantics would be "good enough" for us, I think, and avoids
the 4-byte-per-character index bloat of extending the UTF-8 support.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Unicode high characters versus MySQL 5