On sab, 2002-03-09 at 14:21, Lars Aronsson wrote:
Jan Hidders wrote:
We cannot index UTF-8.
We shouldn't. We should strip down to 7bit U.S. ASCII before
indexing. Searching for o should find any occurance of ö, ó or ô.
This works great for English, Swedish, Norwegian, Danish, Finnish, and
German. I have successfully tried this on other websites before, but
I cannot speak for other languages. Of course, the search expression
must be stripped in the same way before the search is performed.
That's only relevant for accented Latin characters, obviously. Hebrew,
Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
be retained and searchable. (However we can similarly fold together
cases and accents for Greek, perhaps final/medial forms for Greek,
Hebrew, and Arabic, and possibly katakana/hiragana for Japanese.)
So yes, we need to index UTF-8 if we're using it.
Also, in the stripping down, any E following a wovel
could be removed,
to avoid the confusion between spellings like Gottingen, Goettingen,
and Göttingen, and that Danish poet Oehlenschläger.
This sort of search will yield a few hits too many, which is good.
I'm not advocating soundex matching here, but soundex could be
implemented in the same way.
I have no objection to the above. Would match potato/potatoe, too. :)
-- brion vibber (brion @
pobox.com)