Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen
wrote:
Related to this, I'd suggest to add "ISO15924" column to
"characterset"
(or future "script") table. This way a script can be formally specified
and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its
own right so it may have as many translations as we care to enter. The
code is just to anchor it. I understand from Erik's notes that a
language can be indicated as the default value.. The default value will
be English. So, add a translation to the English word and from then on
the User Interface will show it localised.
I'll talk about this below.
Why is
there column "gender" in table "word"? If a word can exist in
multiple genders, shouldn't that rather be represented in "inflection"
table? If a word has a gender on its own, wouldn't that rather be
represented in WordType table? If not, there are other properties of
words (for example, number) which can also be represented in "word"
table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in
its own right and consequently will be found in the UW. The inflection
is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give
an example? For example, the word "white" is a base word and the word
"whiter" is its inflection. How would these two words fit into the database?
Both words will exist as a Spelling, as a Word and they may share a
Meaning. When the inflections are added, in the Inflection-Word, all the
missing words will be created and they will all be related to each other
through this table. Contrary to a paper dictionary we want them all.
relevant for the inflections and the headword.. A
Wordtype indicates a
noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word
could have. For example, a verb could be transitive or intransitive, and this
information is important. To give an example:
word: horse
gender: male
partofspeech: noun
word: ship
gender: female
partofspeech: noun
word: to drive
gender: none
partofspeech: transitive verb
word: to swim
gender: none
partofspeech: intransitive verb
See what I mean? If you have to specify transitivity of a verb in
"partofspeech" table, you may as well specify gender of a noun in that table.
It would be consistent either to remove "gender" column from "word"
table:
word: horse
partofspeech: male noun
word: ship
partofspeech: female noun
word: to drive
partofspeech: transitive verb
word: to swim
partofspeech: intransitive verb
or to rename it to, for example, "subtype":
word: horse
subtype: male
partofspeech: noun
word: ship
subtype: female
partofspeech: noun
word: to drive
subtype: transitive
partofspeech: verb
word: to swim
subtype: intransitive
partofspeech: verb
If you are going to change this, I'd suggest the first solution. Firstly,
because there may be words which would have more than one subtype; secondly,
because it eliminates the possibility of having invalid mix of subtypes
(horse: intransitive noun...).
At this moment in time I would not have intransitive verbs or transitive
verbs at all. To me they are verbs. When they are transitive, they have
a different meaning from when they are intransitive so to me the
destinction is in the meaning.
I'd strongly suggest to add a column
"inflection", to either "wordtype" or
"word" table; this would specify which inflection does a word use, and
whether it is regular or irregular. If it is known which inflexion a
word uses and if it is regular, then all its inflected forms could be
generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I
understand this, it means that meaning of a word could be written down in
various languages, and I second this. But I wonder how are you going to do
the same for other data which might need to be translated (for example,
column "characterset" of of "characterset" table - I understand that
this
is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate
to meaning; Ultimate
Wiktionary will eat its own dogfood or when a translation to a word like
noun is added like I did for Afrikaans recently, this translation is the
one that will be used in the User Interface
OK, but what if you have a longer phrase as a table field? For example, an
"inflection" in table "inflection" might be "male genitive
superlative" or
"3rd person plural female past". I don't think it makes sense to add such
phrases to the dictionary as proper entries, only so that the dictionary
would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind
above?
Most if not all text fields will be inherently translatable, this is
what I have very much in mind. The name of a font will not be translated
but that is the only one at this point in time. It makes perfect sense
to have this in the UW as it allows us to have a self learning User
Interface. The thing is; it has function.
Were you thinking about a way to register examples of
use, similar to
meaning? Or would examples of use be simply a raw text in "meaningtext"
table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a
proverb will relate to a keyword through WordRelation. I have updated
the table Relation with a newfield "SameLanguageOnly" this ensures that
the relation is applicable within the same language so the relation
would be "proverb" and it would combine "apple" with "an apple a
day
keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
I was thinking about something else; for example, on
http://en.wiktionary.org/wiki/account there is this example: "A beggarly
account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I
understand now it is going to be just a part of "meaningtext". I'm not so
certain, but maybe it would be good to create a separate table for examples,
because same examples could (and probably will) be used in "meaningtext"s in
different languages. It would also make it easier to automatically add new
examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a
seperate Word and marked as such ?? It would be a idiom for "account"
and this is linked through Relation. Many famous quotes have been
translated and we could have them all. (Een paard , een paard, een
koninkrijk voor een paard)
I do not think grepping Project Gutenberg makes much sense. If anything
it helps you find occurances of the word but you have to be selective of
what to include. That is an editorial process and just the fact that a
word is used does not make for a good idiom in the UW.
One reason why it is not as much documented as I would
like is, because
I am still working on the structure. At this moment I am thinking hard
on how to include signed languages and the spoken dialects of the
Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language
table. That way, you could formally represent a dialect within a certain
country (as for Arabic) or even a region (as for Chinese). Perhaps even
better solution would be to include a RegionID column which would point
to table
"regions" (relation 1-many), which would have RegionID, ISO3166_1 and
ISO3166_2 columns; that way you could specify wider regions in which a
dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned.
This database is about words in languages and dialects.
I still think that this would be an useful way of formally specifying a
dialect. For example, British English would have ISO639_2 code "en" and
ISO3166_1 code "uk" while Australian English would have ISO639_2 code
"en"
but ISO3166_1 code "au".
Even the ISO-639 codes in the table are there to connect what we are
doing in the Wikipedias and other projects. As it is a standard I added
it but in the database the ISO 639 fields are not compulsory, the "WMF
key" is. If we "need" these ISO639_2 codes, then we would adhere to the
principle that a language is a dialect with an army. Have a look at
http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwij…
and you will see how we do some of the uk and au stuff for you. This is
however not a great example because it is a mix of different spelling
but also vocabulary and scripts. As I was not content with this I came
up with the current ERD.
Were you thinking about a way to formally define a
dialect? Ideally,
beside a region (by the way, ISO3166-2 is not granular enough, and there
should be a better way of expressing region in which a dialect is spoken,
perhaps even to the level of a village), there should be specified a time
period in which a dialect was spoken, social layer which was speaking it,
and perhaps even a particular person or entity using it. OK, I got
carried away a bit but such things might be important :) Though some of
this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is
the starting point. There will be a need for some formality; once a
dialect is recognised, it will be hard to take it away. Therefore in my
opinion it will be after some discussion. They will be added as such by
an admin. I would think that we need to consider what it takes before we
add a dialect. Tentatively I would go for at least 100 words defined as
such. With a dialect I would assume that words that are not defined are
those of the higher level language.
When I was referring to "dialect", I did not have in mind a dialect that is
officially recognised, but simply a set of words which could be identified as
belonging to a certain group. So if you want to say that this word was part of
London dockworkers' slang in 1800s, you should be able to do so, and not just
stamp it with "British English".
When there are words that are specific to London dockworkers in the
1800s, I would not call it a dialect because like many professions they
have there own vocabulary. These I would mark within a collection as the
bulk of what they say would be London English of the 1800s. Now there is
one thing that is relevant, the UW wants all words of all languages but
its primary purpose it to have the current vocabulary. So yes, these
words exist and have their place but when they are not used anymore they
should be marked as such.
As a simple example, in Serbia, there are several
publishing houses that were
publishing Asterix, and in some translations "Idefix" is named
"Garoviks" and
"Panoramix" is named "Aspiriniks" while in others "Idefix"
is named "Idefiks"
and "Panoramix" is named "Panoramiks"; and this is consistent. If you
are
going to translate something about Asterix to Serbian, you should pick one of
the translations, but you should be consistent in using only the words from
the translation which you have picked, and they should somehow be marked as
belonging to the same translation. There surely are more important things than
Asterix where similar might apply.
Garoviks and Idefiks are for the Serbian language synonyms and as such I
do not have to choose because they are both correct. As a matter of
interest you could explain things either in the etymology or in the
meaning of the word.