On sab, 2002-03-09 at 18:50, Lars Aronsson wrote:
Brion L. VIBBER wrote:
That's only relevant for accented Latin
characters, obviously. Hebrew,
Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
be retained and searchable.
Are we talking about Greek/Hebrew characters in the English/German
Wikipedia now? I think users of the English/German Wikipedia won't
have Greek/Hebrew keyboards,
Excepting Greeks and Israelis, obviously. ;)
so ASCII searching would do just fine.
But why bother creating a special separate ASCII-only search, when the
non-Latin code is necessary for other languages and we're using a
unified character set?
Why *shouldn't* I be able to search for the occasional Greek, Hebrew, or
Japanese word in the original spelling on the English wikipedia, if we
allow people to put them in in the first place?
I have no idea how to implement search in the
Greek/Hebrew Wikipedia.
As stated above: do whatever accent/case/other equivalent conversion is
necessary (exactly as you propose for Latin characters), and perform
some conversion so that MySQL doesn't reject the UTF-8 non-ascii
characters as word separators (in an ideal world, we'd just configure
MySQL to understand UTF-8; otherwise, replacing raw bytes with hex codes
should work fine).
-- brion vibber (brion @
pobox.com)