Re: [Wikitech-l] wiktionary

12 Feb 2004

Nikola Smolenski wrote:

...
 Though I am not active in Wiktionary, I am thinking
about this for a while. I 
think that MediaWiki software, great as it is, is not adequate enough for 
creating a dictionary and that new software has to be made from scratch. I 
call this kind of software WikiBase - a database in which (more or less) each 
field acts like a wiki page - it could be changed anytime, has an edit 
history, could be discussed etc...

In its simplest form, the database of a dictionary might have four tables:

words: ID|word
concepts: ID|concept
languages: ID|language
meanings: wordID|languageID|conceptID

They could be connected like this:

words:
1|egg
2|jaje

concepts:
1|Something laid by a hen

languages:
1|English
2|Serbian

meanings:
1|1|1
2|2|1

That is, the word "egg" in language called "English" has the meaning
of 
"Something laid by a hen", and the word "jaje" in language called
"Serbian" 
has the same meaning. Now, this has obvious flaws, but I have not envisioned 
the database to be so simple. To cut to the point, I think that following 
structure would be enough for satisfying all needs of a dictionary (WARNING: 
long and sometimes confusing text ahead):

writings: ID|spelling
readings: ID|reading
languages: ID|language|dialect|place|time|group
basics: ID|basic
words: ID|languageID|writingID|readingID
grammar: wordID|relation|wordID
concepts: basicID|relation|basicID
meanings: wordID|basicID

Now, how would all this work. I will use as an example english word "hair" for 
which three Serbian words exist: "kosa" (hair on one's head),
"dlaka" (a 
hair) or "malja" (a hair on body).

Table "writlings" contain exact words as written on paper:

writings:
1|hair
2|kosa
3|dlaka
4|malja
5|hairs

Table "readings" contains readings of the words. I guess that it might be the 
easiest to use an internal format for this, which could be externally 
represented as IPA or SAMPA. Of course, for some languages, the readings 
could be autogenerated.

readings:
1|hejr
2|kosa
3|dlaka
4|mal<sup>j</sup>a
5|mal<sup>j</sup>e

(Note that here some IDs are the same; this of course need not be the case.)

Table "languages" contains data about languages. I was thinking big and 
allowed for ability of defining various dialects, regions, exact (or not 
exact) time at which a word was in use, and slang (of a certain social 
group). Perhaps this table needs a bit more work, but the basic idea is 
there. In this example, I'll use only the language name and forget about the 
rest:

languages:
1|English
2|Serbian

The last of basic tables, "basics", describes basic concepts.

basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head

Rest of the tables shows relations among IDs of these tables. Table "words" 
shows which writing has which spelling in which language.

words:
1|1|1|1
2|2|2|2
3|2|3|3
4|2|4|4
5|1|5|NULL
6|2|NULL|5

I'll expand the table:

1|English|hair|hejr
2|Serbian|kosa|kosa
3|Serbian|dlaka|dlaka
4|Serbian|malja|mal<sup>j</sup>a
5|English|hairs|NULL
6|Serbian|NULL|mal<sup>j</sup>e

Note that how English word "hairs" is read is currently not known and how a 
certain Serbian word is actually written is also currently not known. It 
doesn't matter.

Now, table "grammar" explains grammatical relations between the words:

1|root|1
2|root|2
3|root|3
4|root|4
5|plural|1
6|plural|4

Expanded:

hair|root|hair
kosa|root|kosa
dlaka|root|dlaka
malja|root|malja
hairs|plural|hair
malje|plural|malja

I will explain what this "root" property means later when I explain how to 
actually query the database.

Table "concepts" is similar, except that it explains relations of the 
basic concepts:

1|mass/root|1
2|root|2
3|root|3
4|root|4
2|includes|3
2|includes|4

I will not expand the table but rather show the table "basics" again.

basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head

FINALLY, table "meanings" attaches words to concepts.

1|1
2|1
2|2
3|3
4|4
5|5

Now, how to read the dictionary. Suppose that you want to know what the word 
"hairs" means in English language. You go to the table "writings" and
find 
"hairs" which has an ID of 5. Then you go to the table "languages" and
find 
"Englihs" with an ID of 1. You go to the table "words" and see that ID
for 
this word is 5 (along the way you might pick up reading of a word). Now you 
go to the table "grammar" and see that the word 5 is actually plural form of 
the word 1. In the same table you examine the word 1 and see that it is a 
root word; that is, one attached to a concept. Now, when you have found out 
that, search in "meanings" what concepts are attached to the root word 1 and 
you will see that there are two: concept 1 and 2. In the table "concepts" you 
search for them and find out that concept 2 is a root concept and concept 1 
is also a root concept, but one of a mass concept; that is, of a thing that 
comes in an undistinguished mass. Finally you have:

'''hairs''':
1. ((rarely) plural denoting different kinds of) A bunch of hairs on someone's 
head
2. (plural) A ceratinous outgrowth that covers human body

Want to get Serbian translation? You go back through the tables, starting from 
basic concepts 1 and 2. You see in table "concepts" that concept 2 includes 
in itself concepts 3 and 4. You to the table "meanings" and see that concepts 
1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not 
important, and in tabke "words" you see that only words 2, 3 and 4 are of 
Serbian language, and in "writings" that they are "kosa",
"dlaka" and 
"malja". You now may go back and get their exact meanings and find exact 
translation that you need - more then usual dictionary has to offer.

This database system allows for much more then current free-form Wiktionary or 
the usual dictionaries. It would be easy to create a aoftware suited to 
specific needs that would browse the database on or off line. All gramaticall 
forms of a word are noted and it is easy to make basic machine translation 
from one language to another. It is easy to look up a word in unknown 
language when you don't know its root (this is often a problem, especially 
with electronic dictionaries).I have not shown in this example, but table 
"concepts" would include .. which would enable searching for them and 
creation of basic thesaura for all languages. It would also be possible to 
extract separate professional subdictionaries etc. etc.

Now, if you have bothered to read all this, you might as well spend just a bit 
more time to tell me what do you mean about it. I would be especially 
grateful if someone could find something that cannot fit into this kind of a 
dictionary. I already see a possible flaw; that is, that concept 1 in some 
cultures is not a mass concept; but I am certain that this could be overcome.
 Respectfully, may I call this scheme hair-brained, though I know that 
"hair" in that expression is a common error when "hare" should be
used. 
 At least it's naïve.  People don't read instructions except as an 
absolute last resort. When they would need to have such complicated 
instruction to understand a difference in meaning focused on one concept 
in only 2 languages they would put the explanation down and do something 
else. Serbian and Croatian are much more closely related, but I'm sure 
that the subtleties that make them different to explain, especially to 
an English speaker who doesn't know much about either one.  

Among the expressions which use hair in English we have
    The gun has a hair trigger = The trigger mechanism is very sensitive 
to the slightest pressure
    He's got him by the short hairs = he's got him in a difficult 
position that is equal to pulling on his pubic hairs
    I had some of the hair of the dog that bit me = I had some of what I 
drank last night to help relieve the hangover.
How's that for examples to start with? :-)

The point is that language structure is extremely complicated, and in 
much of what matters there is rarely a simple one-to-one correspondence 
between languages.  That's often why machine translations look so much 
like they came from a machine.

...
 Final note: I think that the dictionary should not be
under GFDL; rather, 
under a similar licence which would allow full copyright of a work derived 
from the subset of a database, but not one derived from it superset; in other 
words, it would be possible for someone to take this dictionary, lays the 
word on the paper, print it, and sell it, and noone would have the right to 
photocopy or reprint it; but if that someone wants to add some words to the 
dictionary, he must add them to the database first, which would then enable 
anyone else to add them to their dictionaries.
 Copyrights on dictionaries are highly disputable.  They are often a 
combination of material that may be both in and out of copyright.  The 
1913 Webster which is being used as a starting point for many of the 
English words is well out of copyright.  Single definitions can always 
be considered fair use.  Claiming copyrights as you suggest may be more 
complicated than it's worth.

Ec

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] wiktionary