Alexander Prudnikov wrote:
Hello.
Can you explain me in a few words how Wiki engine performs full-text search in UTF-8
encoded articles?
This is a very important problem for me. I have a database in UTF-8. MySQL prior 4.1
doesn't support full-text search in UTF-8 text. Only alpha-version of mysql
4.1 is available at the moment. So I don't want to install it.
I tried to look for the answer in the Wiki sources. But I realized
that this would take a rather long time. The only thing I understood is
that search keys are somehow stored in the table 'searchindex'.
So can anyone tell me the basic idea how Wiki performs the fulltext search?
Thanks for your time.
Best regards,
Alexander Prudnikov.
The handling depends on the language. The basic UTF-8 handling is to
convert to lower case using an internal table, then to encode any
non-ASCII characters as hexadecimal using bin2hex(). The Chinese and
Japanese language files have special routines to insert spaces into
strings, since MySQL uses a word search and those languages don't
usually use spaces.
The relevant functions are doUpdate() in includes/SearchUpdate.php, and
stripForSearch() in languages/LanguageUtf8.php .
-- Tim Starling