Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary

23 Jul 2005

On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
...
  Nikola Smolenski wrote:
 On Wednesday 20 July 2005 21:29, Gerard Meijssen
wrote:
Related to this, I'd suggest to add "ISO15924" column to
"characterset"
 (or future "script") table. This way a script can be formally specified
 and looked up, regardless of its name. 
 Done, however the name of the script is a record in the database in its
 own right so it may have as many translations as we care to enter. The
 code is just to anchor it. I understand from Erik's notes that a
 language can be indicated as the default value.. The default value will
 be English. So, add a translation to the English word and from then on
 the User Interface will show it  localised. 
I'll talk about this below.

...
  Why is there
column "gender" in table "word"? If a word can exist in
 multiple genders, shouldn't that rather be represented in "inflection"
 table? If a word has a gender on its own, wouldn't that rather be
 represented in WordType table? If not, there are other properties of
 words (for example, number) which can also be represented in "word"
 table, why is gender singled out? 
 When a word is inflected to a particular form, that word is a word in
 its own right and consequently will be found in the UW. The inflection
 is there because it does provide information and this information is 
Now I'm not so sure that I understand which table is for what. Could you give 
an example? For example, the word "white" is a base word and the word 
"whiter" is its inflection. How would these two words fit into the database?

...
  relevant for the inflections and the headword.. A
Wordtype indicates a
 noun a verb an adjective etc. 
I still don't understand why is gender singled out of all properties a word 
could have. For example, a verb could be transitive or intransitive, and this 
information is important. To give an example:

word: horse
gender: male
partofspeech: noun

word: ship
gender: female
partofspeech: noun

word: to drive
gender: none
partofspeech: transitive verb

word: to swim
gender: none
partofspeech: intransitive verb

See what I mean? If you have to specify transitivity of a verb in 
"partofspeech" table, you may as well specify gender of a noun in that table. 
It would be consistent either to remove "gender" column from "word"
table:

word: horse
partofspeech: male noun

word: ship
partofspeech: female noun

word: to drive
partofspeech: transitive verb

word: to swim
partofspeech: intransitive verb

or to rename it to, for example, "subtype":

word: horse
subtype: male
partofspeech: noun

word: ship
subtype: female
partofspeech: noun

word: to drive
subtype: transitive
partofspeech: verb

word: to swim
subtype: intransitive
partofspeech: verb

If you are going to change this, I'd suggest the first solution. Firstly, 
because there may be words which would have more than one subtype; secondly, 
because it eliminates the possibility of having invalid mix of subtypes 
(horse: intransitive noun...).

...
  I'd
strongly suggest to add a column "inflection", to either "wordtype"
or
"word" table; this would specify which inflection does a word use, and
whether it is regular or irregular. If it is known which inflexion a
word uses and if it is regular, then all its inflected forms could be
generated automatically.

I see that there is column "languageid" in "meaningtext" table. If I
understand this, it means that meaning of a word could be written down in
various languages, and I second this. But I wonder how are you going to do
the same for other data which might need to be translated (for example,
column "characterset" of of "characterset" table - I understand that
this
 is pure text? Were you thinking about this? 
 The fields "Sign" :) "Gender" "WordType" all relate to
meaning; Ultimate
 Wiktionary will eat its own dogfood or when a translation to a word like
 noun is added like I did for Afrikaans recently, this translation is the
 one that will be used in the User Interface 
OK, but what if you have a longer phrase as a table field? For example, an 
"inflection" in table "inflection" might be "male genitive
superlative" or 
"3rd person plural female past". I don't think it makes sense to add such 
phrases to the dictionary as proper entries, only so that the dictionary 
would have translations of them.

Are some table fields inherently translatable? Is this what you had in mind 
above?

...
  Were you
thinking about a way to register examples of use, similar to
 meaning? Or would examples of use be simply a raw text in "meaningtext"
 table? 
 Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a
 proverb will relate to a keyword through WordRelation. I have updated
 the table Relation with a newfield "SameLanguageOnly" this ensures that
 the relation is applicable within the same language so the relation
 would be "proverb" and it would combine "apple" with "an apple a
day
 keeps the doctor away".

 MeaningText would be just the definition of a meaning in a given language. 
I was thinking about something else; for example, on 
http://en.wiktionary.org/wiki/account there is this example: "A beggarly 
account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I 
understand now it is going to be just a part of "meaningtext". I'm not so 
certain, but maybe it would be good to create a separate table for examples, 
because same examples could (and probably will) be used in "meaningtext"s in 
different languages. It would also make it easier to automatically add new 
examples (for example, by grepping Project Gutenberg ;)

...
   One reason why it is not as much documented as I would
like is, because
I am still working on the structure. At this moment I am thinking hard
on how to include signed languages and the spoken dialects of the
Chinese and Arabic written language. 
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language
 table. That way, you could formally represent a dialect within a certain
 country (as for Arabic) or even a region (as for Chinese). Perhaps even
 better solution would be to include a RegionID column which would point
 to table
"regions" (relation 1-many), which would have RegionID, ISO3166_1 and
ISO3166_2 columns; that way you could specify wider regions in which a
dialect is spoken, even if they go over country boundaries. 
 The country code is irrelevant as far as this database is concerned.
 This database is about words in languages and dialects. 
I still think that this would be an useful way of formally specifying a 
dialect. For example, British English would have ISO639_2 code "en" and 
ISO3166_1 code "uk" while Australian English would have ISO639_2 code
"en" 
but ISO3166_1 code "au".

...
  Were you
thinking about a way to formally define a dialect? Ideally,
 beside a region (by the way, ISO3166-2 is not granular enough, and there
 should be a better way of expressing region in which a dialect is spoken,
 perhaps even to the level of a village), there should be specified a time
 period in which a dialect was spoken, social layer which was speaking it,
 and perhaps even a particular person or entity using it. OK, I got
 carried away a bit but such things might be important :) Though some of
 this should perhaps be (also) tied to a word and not a dialect. 
 Ultimate wiktionary is about words (written, spoken or signed) that is
 the starting point. There will be a need for some formality; once a
 dialect is recognised, it will be hard to take it away. Therefore in my
 opinion it will be after some discussion. They will be added as such by
 an admin. I would think that we need to consider what it takes before we
 add a dialect. Tentatively I would go for at least 100 words defined as
 such. With a dialect I would assume that words that are not defined are
 those of the higher level language. 
When I was referring to "dialect", I did not have in mind a dialect that is 
officially recognised, but simply a set of words which could be identified as 
belonging to a certain group. So if you want to say that this word was part of 
London dockworkers' slang in 1800s, you should be able to do so, and not just 
stamp it with "British English".

As a simple example, in Serbia, there are several publishing houses that were 
publishing Asterix, and in some translations "Idefix" is named
"Garoviks" and 
"Panoramix" is named "Aspiriniks" while in others "Idefix"
is named "Idefiks" 
and "Panoramix" is named "Panoramiks"; and this is consistent. If you
are 
going to translate something about Asterix to Serbian, you should pick one of 
the translations, but you should be consistent in using only the words from 
the translation which you have picked, and they should somehow be marked as 
belonging to the same translation. There surely are more important things than 
Asterix where similar might apply.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary