Nikola Smolenski wrote:
Though I am not active in Wiktionary, I am thinking
about this for a while. I
think that MediaWiki software, great as it is, is not adequate enough for
creating a dictionary and that new software has to be made from scratch. I
call this kind of software WikiBase - a database in which (more or less) each
field acts like a wiki page - it could be changed anytime, has an edit
history, could be discussed etc...
In its simplest form, the database of a dictionary might have four tables:
words: ID|word
concepts: ID|concept
languages: ID|language
meanings: wordID|languageID|conceptID
They could be connected like this:
words:
1|egg
2|jaje
concepts:
1|Something laid by a hen
languages:
1|English
2|Serbian
meanings:
1|1|1
2|2|1
That is, the word "egg" in language called "English" has the meaning
of
"Something laid by a hen", and the word "jaje" in language called
"Serbian"
has the same meaning. Now, this has obvious flaws, but I have not envisioned
the database to be so simple. To cut to the point, I think that following
structure would be enough for satisfying all needs of a dictionary (WARNING:
long and sometimes confusing text ahead):
writings: ID|spelling
readings: ID|reading
languages: ID|language|dialect|place|time|group
basics: ID|basic
words: ID|languageID|writingID|readingID
grammar: wordID|relation|wordID
concepts: basicID|relation|basicID
meanings: wordID|basicID
Now, how would all this work. I will use as an example english word "hair" for
which three Serbian words exist: "kosa" (hair on one's head),
"dlaka" (a
hair) or "malja" (a hair on body).
Table "writlings" contain exact words as written on paper:
writings:
1|hair
2|kosa
3|dlaka
4|malja
5|hairs
Table "readings" contains readings of the words. I guess that it might be the
easiest to use an internal format for this, which could be externally
represented as IPA or SAMPA. Of course, for some languages, the readings
could be autogenerated.
readings:
1|hejr
2|kosa
3|dlaka
4|mal<sup>j</sup>a
5|mal<sup>j</sup>e
(Note that here some IDs are the same; this of course need not be the case.)
Table "languages" contains data about languages. I was thinking big and
allowed for ability of defining various dialects, regions, exact (or not
exact) time at which a word was in use, and slang (of a certain social
group). Perhaps this table needs a bit more work, but the basic idea is
there. In this example, I'll use only the language name and forget about the
rest:
languages:
1|English
2|Serbian
The last of basic tables, "basics", describes basic concepts.
basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head
Rest of the tables shows relations among IDs of these tables. Table "words"
shows which writing has which spelling in which language.
words:
1|1|1|1
2|2|2|2
3|2|3|3
4|2|4|4
5|1|5|NULL
6|2|NULL|5
I'll expand the table:
1|English|hair|hejr
2|Serbian|kosa|kosa
3|Serbian|dlaka|dlaka
4|Serbian|malja|mal<sup>j</sup>a
5|English|hairs|NULL
6|Serbian|NULL|mal<sup>j</sup>e
Note that how English word "hairs" is read is currently not known and how a
certain Serbian word is actually written is also currently not known. It
doesn't matter.
Now, table "grammar" explains grammatical relations between the words:
1|root|1
2|root|2
3|root|3
4|root|4
5|plural|1
6|plural|4
Expanded:
hair|root|hair
kosa|root|kosa
dlaka|root|dlaka
malja|root|malja
hairs|plural|hair
malje|plural|malja
I will explain what this "root" property means later when I explain how to
actually query the database.
Table "concepts" is similar, except that it explains relations of the
basic concepts:
1|mass/root|1
2|root|2
3|root|3
4|root|4
2|includes|3
2|includes|4
I will not expand the table but rather show the table "basics" again.
basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head
FINALLY, table "meanings" attaches words to concepts.
1|1
2|1
2|2
3|3
4|4
5|5
Now, how to read the dictionary. Suppose that you want to know what the word
"hairs" means in English language. You go to the table "writings" and
find
"hairs" which has an ID of 5. Then you go to the table "languages" and
find
"Englihs" with an ID of 1. You go to the table "words" and see that ID
for
this word is 5 (along the way you might pick up reading of a word). Now you
go to the table "grammar" and see that the word 5 is actually plural form of
the word 1. In the same table you examine the word 1 and see that it is a
root word; that is, one attached to a concept. Now, when you have found out
that, search in "meanings" what concepts are attached to the root word 1 and
you will see that there are two: concept 1 and 2. In the table "concepts" you
search for them and find out that concept 2 is a root concept and concept 1
is also a root concept, but one of a mass concept; that is, of a thing that
comes in an undistinguished mass. Finally you have:
'''hairs''':
1. ((rarely) plural denoting different kinds of) A bunch of hairs on someone's
head
2. (plural) A ceratinous outgrowth that covers human body
Want to get Serbian translation? You go back through the tables, starting from
basic concepts 1 and 2. You see in table "concepts" that concept 2 includes
in itself concepts 3 and 4. You to the table "meanings" and see that concepts
1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not
important, and in tabke "words" you see that only words 2, 3 and 4 are of
Serbian language, and in "writings" that they are "kosa",
"dlaka" and
"malja". You now may go back and get their exact meanings and find exact
translation that you need - more then usual dictionary has to offer.
This database system allows for much more then current free-form Wiktionary or
the usual dictionaries. It would be easy to create a aoftware suited to
specific needs that would browse the database on or off line. All gramaticall
forms of a word are noted and it is easy to make basic machine translation
from one language to another. It is easy to look up a word in unknown
language when you don't know its root (this is often a problem, especially
with electronic dictionaries).I have not shown in this example, but table
"concepts" would include .. which would enable searching for them and
creation of basic thesaura for all languages. It would also be possible to
extract separate professional subdictionaries etc. etc.
Now, if you have bothered to read all this, you might as well spend just a bit
more time to tell me what do you mean about it. I would be especially
grateful if someone could find something that cannot fit into this kind of a
dictionary. I already see a possible flaw; that is, that concept 1 in some
cultures is not a mass concept; but I am certain that this could be overcome.
Respectfully, may I call this scheme hair-brained, though I know that
"hair" in that expression is a common error when "hare" should be
used.
At least it's naïve. People don't read instructions except as an
absolute last resort. When they would need to have such complicated
instruction to understand a difference in meaning focused on one concept
in only 2 languages they would put the explanation down and do something
else. Serbian and Croatian are much more closely related, but I'm sure
that the subtleties that make them different to explain, especially to
an English speaker who doesn't know much about either one.
Among the expressions which use hair in English we have
The gun has a hair trigger = The trigger mechanism is very sensitive
to the slightest pressure
He's got him by the short hairs = he's got him in a difficult
position that is equal to pulling on his pubic hairs
I had some of the hair of the dog that bit me = I had some of what I
drank last night to help relieve the hangover.
How's that for examples to start with? :-)
The point is that language structure is extremely complicated, and in
much of what matters there is rarely a simple one-to-one correspondence
between languages. That's often why machine translations look so much
like they came from a machine.
Final note: I think that the dictionary should not be
under GFDL; rather,
under a similar licence which would allow full copyright of a work derived
from the subset of a database, but not one derived from it superset; in other
words, it would be possible for someone to take this dictionary, lays the
word on the paper, print it, and sell it, and noone would have the right to
photocopy or reprint it; but if that someone wants to add some words to the
dictionary, he must add them to the database first, which would then enable
anyone else to add them to their dictionaries.
Copyrights on dictionaries are highly disputable. They are often a
combination of material that may be both in and out of copyright. The
1913 Webster which is being used as a starting point for many of the
English words is well out of copyright. Single definitions can always
be considered fair use. Claiming copyrights as you suggest may be more
complicated than it's worth.
Ec