Gerard Meijssen wrote:
Nikola Smolenski wrote:
You don't want to duplicate entire Russian
corpus (with
inflections, it could easily rise to ten million words), so
that you could have each one of them with and without
diacritics. It makes sense to have only canonical spellings in
the dictionary, and a bit of code to offer nearest match when
someone tries to retrieve a word spelled in a different way.
As a matter a fact I do want all inflections even if they are
ten million words. Now I do not expect to have all these
inflections to start off with but as far as I am concerned I
want them all. I already have 222.930 Dutch words and they do
include many inflections.
There are two approaches to dictionaries: (1) The encyclopedic
approach, trying to find (define, spellcheck, explain, ...) "all"
words (and their deflections), or (2) the statistics based
approach, trying to find the most commonly used words. I think
the OED is of the first kind, while many dictionaries in recent
decades (built with the help of computers, extracting word
frequency statistics from large text corpora) have been of the
latter kind. Some would call (1) a 19th century approach.
The real difference is their handling of the least common words.
The encyclopedic approach sees every missing word as a failure,
while the statistics based approach recognizes that there is an
infinite number of words anyway (new ones are created every day)
and some might be too uncommon to deserve a mention.
As a consequence, spellchecking in the statistics based approach
can never say that a spelling is "wrong" when it is missing from
the dictionary, only that it probably is "uncommon" and thus
suspect. The remedy for this is a statics based dictionary of
common misspellings. Wikipedia article history can be used as a
source for this. Just find all edits that changed one word, e.g.
speling -> spelling, and you will have a fine dictionary of common
spelling mistakes.
From a database point of view a Word has one Spelling.
This would be an example of the encyclopedic approach.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se