On 4/14/06, Brion Vibber <brion(a)pobox.com> wrote:
[snip]
It could break string matching, but would definitely
break sorting. (Sorting by
codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a
*different character set* from the one really you're using seems pretty silly.
You get unpredictable, incorrect sorting and potentially have strings rejected
as invalid.
The collation problem is a hard problem in general, as I understand
it, as there are some cases where the collation of some unicode
characters changes depending on the language.. For example, the
position of ΓΈ in danish vs most other languages. ... although doing
it wrong but mostly right isn't too hard.
Thus supporting multiple languages correctly in a single database
becomes a little difficult. I don't think it's reasonable to expect
the database to allow you to magically specify a new collation on the
fly for each query, since index order depends on collation.
Instead, given sufficient support in the database, you could create a
function enumerate_collation(language,string) which returns an
integer array (or a mangled string), with one value for the absolute
collation position of each character in the string. You could then
define index on that function applied to the title column for each of
the collations you will be using, and ORDER BY
enumerate_collation('en',title) in your queries.