On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen
wrote:
Related to this, I'd suggest to add "ISO15924" column to
"characterset"
(or future "script") table. This way a script can be formally specified
and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its
own right so it may have as many translations as we care to enter. The
code is just to anchor it. I understand from Erik's notes that a
language can be indicated as the default value.. The default value will
be English. So, add a translation to the English word and from then on
the User Interface will show it localised.
I'll talk about this below.
Why is there
column "gender" in table "word"? If a word can exist in
multiple genders, shouldn't that rather be represented in "inflection"
table? If a word has a gender on its own, wouldn't that rather be
represented in WordType table? If not, there are other properties of
words (for example, number) which can also be represented in "word"
table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in
its own right and consequently will be found in the UW. The inflection
is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give
an example? For example, the word "white" is a base word and the word
"whiter" is its inflection. How would these two words fit into the database?
relevant for the inflections and the headword.. A
Wordtype indicates a
noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word
could have. For example, a verb could be transitive or intransitive, and this
information is important. To give an example:
word: horse
gender: male
partofspeech: noun
word: ship
gender: female
partofspeech: noun
word: to drive
gender: none
partofspeech: transitive verb
word: to swim
gender: none
partofspeech: intransitive verb
See what I mean? If you have to specify transitivity of a verb in
"partofspeech" table, you may as well specify gender of a noun in that table.
It would be consistent either to remove "gender" column from "word"
table:
word: horse
partofspeech: male noun
word: ship
partofspeech: female noun
word: to drive
partofspeech: transitive verb
word: to swim
partofspeech: intransitive verb
or to rename it to, for example, "subtype":
word: horse
subtype: male
partofspeech: noun
word: ship
subtype: female
partofspeech: noun
word: to drive
subtype: transitive
partofspeech: verb
word: to swim
subtype: intransitive
partofspeech: verb
If you are going to change this, I'd suggest the first solution. Firstly,
because there may be words which would have more than one subtype; secondly,
because it eliminates the possibility of having invalid mix of subtypes
(horse: intransitive noun...).
I'd
strongly suggest to add a column "inflection", to either "wordtype"
or
"word" table; this would specify which inflection does a word use, and
whether it is regular or irregular. If it is known which inflexion a
word uses and if it is regular, then all its inflected forms could be
generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I
understand this, it means that meaning of a word could be written down in
various languages, and I second this. But I wonder how are you going to do
the same for other data which might need to be translated (for example,
column "characterset" of of "characterset" table - I understand that
this
is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to
meaning; Ultimate
Wiktionary will eat its own dogfood or when a translation to a word like
noun is added like I did for Afrikaans recently, this translation is the
one that will be used in the User Interface
OK, but what if you have a longer phrase as a table field? For example, an
"inflection" in table "inflection" might be "male genitive
superlative" or
"3rd person plural female past". I don't think it makes sense to add such
phrases to the dictionary as proper entries, only so that the dictionary
would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind
above?
Were you
thinking about a way to register examples of use, similar to
meaning? Or would examples of use be simply a raw text in "meaningtext"
table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a
proverb will relate to a keyword through WordRelation. I have updated
the table Relation with a newfield "SameLanguageOnly" this ensures that
the relation is applicable within the same language so the relation
would be "proverb" and it would combine "apple" with "an apple a
day
keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
I was thinking about something else; for example, on
http://en.wiktionary.org/wiki/account there is this example: "A beggarly
account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I
understand now it is going to be just a part of "meaningtext". I'm not so
certain, but maybe it would be good to create a separate table for examples,
because same examples could (and probably will) be used in "meaningtext"s in
different languages. It would also make it easier to automatically add new
examples (for example, by grepping Project Gutenberg ;)
One reason why it is not as much documented as I would
like is, because
I am still working on the structure. At this moment I am thinking hard
on how to include signed languages and the spoken dialects of the
Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language
table. That way, you could formally represent a dialect within a certain
country (as for Arabic) or even a region (as for Chinese). Perhaps even
better solution would be to include a RegionID column which would point
to table
"regions" (relation 1-many), which would have RegionID, ISO3166_1 and
ISO3166_2 columns; that way you could specify wider regions in which a
dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned.
This database is about words in languages and dialects.
I still think that this would be an useful way of formally specifying a
dialect. For example, British English would have ISO639_2 code "en" and
ISO3166_1 code "uk" while Australian English would have ISO639_2 code
"en"
but ISO3166_1 code "au".
Were you
thinking about a way to formally define a dialect? Ideally,
beside a region (by the way, ISO3166-2 is not granular enough, and there
should be a better way of expressing region in which a dialect is spoken,
perhaps even to the level of a village), there should be specified a time
period in which a dialect was spoken, social layer which was speaking it,
and perhaps even a particular person or entity using it. OK, I got
carried away a bit but such things might be important :) Though some of
this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is
the starting point. There will be a need for some formality; once a
dialect is recognised, it will be hard to take it away. Therefore in my
opinion it will be after some discussion. They will be added as such by
an admin. I would think that we need to consider what it takes before we
add a dialect. Tentatively I would go for at least 100 words defined as
such. With a dialect I would assume that words that are not defined are
those of the higher level language.
When I was referring to "dialect", I did not have in mind a dialect that is
officially recognised, but simply a set of words which could be identified as
belonging to a certain group. So if you want to say that this word was part of
London dockworkers' slang in 1800s, you should be able to do so, and not just
stamp it with "British English".
As a simple example, in Serbia, there are several publishing houses that were
publishing Asterix, and in some translations "Idefix" is named
"Garoviks" and
"Panoramix" is named "Aspiriniks" while in others "Idefix"
is named "Idefiks"
and "Panoramix" is named "Panoramiks"; and this is consistent. If you
are
going to translate something about Asterix to Serbian, you should pick one of
the translations, but you should be consistent in using only the words from
the translation which you have picked, and they should somehow be marked as
belonging to the same translation. There surely are more important things than
Asterix where similar might apply.