Re: [Wikitech-l] International upgrades

9 Mar 2002

On sab, 2002-03-09 at 14:21, Lars Aronsson wrote:
...
  Jan Hidders wrote:
  We cannot index UTF-8.  
 We shouldn't.  We should strip down to 7bit U.S. ASCII before
 indexing.  Searching for o should find any occurance of ö, ó or ô.
 This works great for English, Swedish, Norwegian, Danish, Finnish, and
 German.  I have successfully tried this on other websites before, but
 I cannot speak for other languages.  Of course, the search expression
 must be stripped in the same way before the search is performed. 
That's only relevant for accented Latin characters, obviously. Hebrew,
Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
be retained and searchable. (However we can similarly fold together
cases and accents for Greek, perhaps final/medial forms for Greek,
Hebrew, and Arabic, and possibly katakana/hiragana for Japanese.)

So yes, we need to index UTF-8 if we're using it.

...
  Also, in the stripping down, any E following a wovel
could be removed,
 to avoid the confusion between spellings like Gottingen, Goettingen,
 and Göttingen, and that Danish poet Oehlenschläger.

 This sort of search will yield a few hits too many, which is good.
 I'm not advocating soundex matching here, but soundex could be
 implemented in the same way. 
I have no objection to the above. Would match potato/potatoe, too. :)

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] International upgrades