Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary

23 Jul 2005

Nikola Smolenski wrote:

...
 On Friday 22 July 2005 13:25, Gerard Meijssen wrote:

 Nikola Smolenski wrote:

 On Wednesday 20 July 2005 21:29, Gerard Meijssen
wrote:
Related to this, I'd suggest to add "ISO15924" column to
"characterset"
(or future "script") table. This way a script can be formally specified
and looked up, regardless of its name.

 Done, however the name of the script is a record in the database in its
own right so it may have as many translations as we care to enter. The
code is just to anchor it. I understand from Erik's notes that a
language can be indicated as the default value.. The default value will
be English. So, add a translation to the English word and from then on
the User Interface will show it  localised.

I'll talk about this below.

  Why is
there column "gender" in table "word"? If a word can exist in
multiple genders, shouldn't that rather be represented in "inflection"
table? If a word has a gender on its own, wouldn't that rather be
represented in WordType table? If not, there are other properties of
words (for example, number) which can also be represented in "word"
table, why is gender singled out?

 When a word is inflected to a particular form, that word is a word in
its own right and consequently will be found in the UW. The inflection
is there because it does provide information and this information is

Now I'm not so sure that I understand which table is for what. Could you give 
an example? For example, the word "white" is a base word and the word 
"whiter" is its inflection. How would these two words fit into the database?

 Both words will exist as a Spelling, as a Word and they may share a 
Meaning. When the inflections are added, in the Inflection-Word, all the 
missing words will be created and they will all be related to each other 
through this table. Contrary to a paper dictionary we want them all.

...

 relevant for the inflections and the headword.. A
Wordtype indicates a
noun a verb an adjective etc.

I still don't understand why is gender singled out of all properties a word 
could have. For example, a verb could be transitive or intransitive, and this 
information is important. To give an example:

word: horse
gender: male
partofspeech: noun

word: ship
gender: female
partofspeech: noun

word: to drive
gender: none
partofspeech: transitive verb

word: to swim
gender: none
partofspeech: intransitive verb

See what I mean? If you have to specify transitivity of a verb in 
"partofspeech" table, you may as well specify gender of a noun in that table. 
It would be consistent either to remove "gender" column from "word"
table:

word: horse
partofspeech: male noun

word: ship
partofspeech: female noun

word: to drive
partofspeech: transitive verb

word: to swim
partofspeech: intransitive verb

or to rename it to, for example, "subtype":

word: horse
subtype: male
partofspeech: noun

word: ship
subtype: female
partofspeech: noun

word: to drive
subtype: transitive
partofspeech: verb

word: to swim
subtype: intransitive
partofspeech: verb

If you are going to change this, I'd suggest the first solution. Firstly, 
because there may be words which would have more than one subtype; secondly, 
because it eliminates the possibility of having invalid mix of subtypes 
(horse: intransitive noun...).

 At this moment in time I would not have intransitive verbs or transitive 
verbs at all. To me they are verbs. When they are transitive, they have 
a different meaning from when they are intransitive so to me the 
destinction is in the meaning.

...
   I'd strongly suggest to add a column
"inflection", to either "wordtype" or
"word" table; this would specify which inflection does a word use, and
whether it is regular or irregular. If it is known which inflexion a
word uses and if it is regular, then all its inflected forms could be
generated automatically.

I see that there is column "languageid" in "meaningtext" table. If I
understand this, it means that meaning of a word could be written down in
various languages, and I second this. But I wonder how are you going to do
the same for other data which might need to be translated (for example,
column "characterset" of of "characterset" table - I understand that
this
is pure text? Were you thinking about this?

 The fields "Sign" :) "Gender" "WordType" all relate
to meaning; Ultimate
Wiktionary will eat its own dogfood or when a translation to a word like
noun is added like I did for Afrikaans recently, this translation is the
one that will be used in the User Interface

OK, but what if you have a longer phrase as a table field? For example, an 
"inflection" in table "inflection" might be "male genitive
superlative" or 
"3rd person plural female past". I don't think it makes sense to add such 
phrases to the dictionary as proper entries, only so that the dictionary 
would have translations of them.

Are some table fields inherently translatable? Is this what you had in mind 
above?

 Most if not all text fields will be inherently translatable, this is 
what I have very much in mind. The name of a font will not be translated 
but that is the only one at this point in time. It makes perfect sense 
to have this in the UW as it allows us to have a self learning User 
Interface. The thing is; it has function.

...
   Were you thinking about a way to register examples of
use, similar to
meaning? Or would examples of use be simply a raw text in "meaningtext"
table?

 Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a
proverb will relate to a keyword through WordRelation. I have updated
the table Relation with a newfield "SameLanguageOnly" this ensures that
the relation is applicable within the same language so the relation
would be "proverb" and it would combine "apple" with "an apple a
day
keeps the doctor away".

MeaningText would be just the definition of a meaning in a given language.

I was thinking about something else; for example, on 
http://en.wiktionary.org/wiki/account there is this example: "A beggarly 
account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I 
understand now it is going to be just a part of "meaningtext". I'm not so 
certain, but maybe it would be good to create a separate table for examples, 
because same examples could (and probably will) be used in "meaningtext"s in 
different languages. It would also make it easier to automatically add new 
examples (for example, by grepping Project Gutenberg ;)

 "A beggarly account of empty boxes" is a quote and why not have it as a 
seperate Word and marked as such ?? It would be a idiom for "account" 
and this is linked through Relation. Many famous quotes have been 
translated and we could have them all. (Een paard , een paard, een 
koninkrijk voor een paard)

I do not think grepping Project Gutenberg makes much sense. If anything 
it helps you find occurances of the word but you have to be selective of 
what to include. That is an editorial process and just the fact that a 
word is used does not make for a good idiom in the UW.

...

   One reason why it is not as much documented as I would
like is, because
I am still working on the structure. At this moment I am thinking hard
on how to include signed languages and the spoken dialects of the
Chinese and Arabic written language.

 The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language
table. That way, you could formally represent a dialect within a certain
country (as for Arabic) or even a region (as for Chinese). Perhaps even
better solution would be to include a RegionID column which would point
to table
"regions" (relation 1-many), which would have RegionID, ISO3166_1 and
ISO3166_2 columns; that way you could specify wider regions in which a
dialect is spoken, even if they go over country boundaries.

 The country code is irrelevant as far as this database is concerned.
This database is about words in languages and dialects.

I still think that this would be an useful way of formally specifying a 
dialect. For example, British English would have ISO639_2 code "en" and 
ISO3166_1 code "uk" while Australian English would have ISO639_2 code
"en" 
but ISO3166_1 code "au".

 Even the ISO-639 codes in the table are there to connect what we are 
doing in the Wikipedias and other projects. As it is a standard I added 
it but in the database the ISO 639 fields are not compulsory, the "WMF 
key" is. If we "need" these ISO639_2 codes, then we would adhere to the 
principle that a language is a dialect with an army. Have a look at 
http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwij…

and you will see how we do some of the uk and au stuff for you. This is 
however not a great example because it is a mix of different spelling 
but also vocabulary and scripts. As I was not content with this I came 
up with the current ERD.

...
   Were you thinking about a way to formally define a
dialect? Ideally,
beside a region (by the way, ISO3166-2 is not granular enough, and there
should be a better way of expressing region in which a dialect is spoken,
perhaps even to the level of a village), there should be specified a time
period in which a dialect was spoken, social layer which was speaking it,
and perhaps even a particular person or entity using it. OK, I got
carried away a bit but such things might be important :) Though some of
this should perhaps be (also) tied to a word and not a dialect.

 Ultimate wiktionary is about words (written, spoken or signed) that is
the starting point. There will be a need for some formality; once a
dialect is recognised, it will be hard to take it away. Therefore in my
opinion it will be after some discussion. They will be added as such by
an admin. I would think that we need to consider what it takes before we
add a dialect. Tentatively I would go for at least 100 words defined as
such. With a dialect I would assume that words that are not defined are
those of the higher level language.

When I was referring to "dialect", I did not have in mind a dialect that is 
officially recognised, but simply a set of words which could be identified as 
belonging to a certain group. So if you want to say that this word was part of 
London dockworkers' slang in 1800s, you should be able to do so, and not just 
stamp it with "British English".

 When there are words that are specific to London dockworkers in the 
1800s, I would not call it a dialect because like many professions they 
have there own vocabulary. These I would mark within a collection as the 
bulk of what they say would be London English of the 1800s. Now there is 
one thing that is relevant, the UW wants all words of all languages but 
its primary purpose it to have the current vocabulary. So yes, these 
words exist and have their place but when they are not used anymore they 
should be marked as such.

...
 As a simple example, in Serbia, there are several
publishing houses that were 
publishing Asterix, and in some translations "Idefix" is named
"Garoviks" and 
"Panoramix" is named "Aspiriniks" while in others "Idefix"
is named "Idefiks" 
and "Panoramix" is named "Panoramiks"; and this is consistent. If you
are 
going to translate something about Asterix to Serbian, you should pick one of 
the translations, but you should be consistent in using only the words from 
the translation which you have picked, and they should somehow be marked as 
belonging to the same translation. There surely are more important things than 
Asterix where similar might apply.
 Garoviks and Idefiks are for the Serbian language synonyms and as such I 
do not have to choose because they are both correct. As a matter of 
interest you could explain things either in the etymology or in the 
meaning of the word.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary