Domas Mituzas wrote:
various opensource software - snowball
(
http://snowball.tartarus.org/).
It has English, French, Spanish, Portuguese, Italian, German, Dutch,
I have made bad experiences with Snowball for the german language. I.e.
the word "Vater" (father) becomes "vat" (a whisky label :-), Mutter
(mother) morphes into "mutt" (a mail program), Müller (miller) changes
into "mull" or - converting the umlaut 'ü' into "ue" - into
"muell"
(waste). These are rediculous sematic results, which unsharpen search
results considerably.
On the other hand many plural word like "Autos" or "Fotos" (cars,
photos) do not change into the desired singular form by Snowball.
Therefore I decided to do my fulltext database "joda" without stemming.
The cost is low: Only some megabytes more of disk space is needed for
the BTree which deals the first level of the retrieving process. The
performance loss is nearly immeasurable. Search results are considerably
better (sharper).
For retrieving, a wildcard at the end of a word (*) helps in most cases
(at least in German) and is a tool which every user understands and
accepts. Maybe there are better stemming tools like snowball for the
german language, but in practice there is no big need for them: Please
note, that most search items are substantives or proper names which
often needs no stemming or are even intolerat to any stemming.
Cheers
jo