On 26/07/05, Lars Aronsson <lars(a)aronsson.se> wrote:
There are two approaches to dictionaries: (1) The
encyclopedic
approach, trying to find (define, spellcheck, explain, ...) "all"
words (and their deflections), or (2) the statistics based
approach, trying to find the most commonly used words. I think
the OED is of the first kind, while many dictionaries in recent
decades (built with the help of computers, extracting word
frequency statistics from large text corpora) have been of the
latter kind. Some would call (1) a 19th century approach.
The real difference is their handling of the least common words.
The encyclopedic approach sees every missing word as a failure,
while the statistics based approach recognizes that there is an
infinite number of words anyway (new ones are created every day)
and some might be too uncommon to deserve a mention.
As a consequence, spellchecking in the statistics based approach
can never say that a spelling is "wrong" when it is missing from
the dictionary, only that it probably is "uncommon" and thus
suspect. The remedy for this is a statics based dictionary of
common misspellings. Wikipedia article history can be used as a
source for this. Just find all edits that changed one word, e.g.
speling -> spelling, and you will have a fine dictionary of common
spelling mistakes.
From a database point of view a Word has one
Spelling.
This would be an example of the encyclopedic approach.
It is clear to me that the approach we want to take is the
"encyclopedic" one, simply because we can handle it. The Oxford
dictionary in paper cannot handle it "elegantly" as it becomes
unwieldy, spans a whole shelf. A good database can.
It is unacceptable for a Word to have one Spelling for reasons
described previously (German a-with-umlaut, Hebrew niqqud and optional
vowels, etc), but I am unable to find out who originally wrote that.