Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary

25 Jul 2005

Nikola Smolenski wrote:

...
 On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:

 Nikola Smolenski wrote:

 On Friday 22 July 2005 13:25, Gerard Meijssen
wrote:

 Nikola Smolenski wrote:

>Why is there column "gender" in table "word"? If a word can exist
in
>multiple genders, shouldn't that rather be represented in "inflection"
>table? If a word has a gender on its own, wouldn't that rather be
>represented in WordType table? If not, there are other properties of
>words (for example, number) which can also be represented in "word"
>table, why is gender singled out?
>          
>
When a word is inflected to a particular form, that word is a word in
its own right and consequently will be found in the UW. The inflection
is there because it does provide information and this information is

 Now I'm not so sure that I understand which table is for what. Could you
give an example? For example, the word "white" is a base word and the
word "whiter" is its inflection. How would these two words fit into the
database?

 Both words will exist as a Spelling, as a Word and they may share a
Meaning. When the inflections are added, in the Inflection-Word, all the
missing words will be created and they will all be related to each other
through this table. Contrary to a paper dictionary we want them all.

Then I have misunderstood the database design :( I believed at first that 
inflections would be stored in "inflection" table. Now when I understand the 
design better, I don't think that it is a good idea to have separate "word"

for each inflection because it brings a lot of unneccesary redudance, and 
much room for error. For example, it would be possible to mark "whiter" as an 
adverb and "white" as a verb! And then, imagine the horror which would ensue 
if someone would use wrong PartOfSpeech for base word and now it has to be 
changed for 100 inflections...

Though this would be a crucial change, please think about it. I think that 
"word" table should contain only lemmas.

 Right, well this is very much a design decision. The inflections will 
have to be entered by hand. And if some poor sod does enter all these 
inflections and they are wrong, there will be the need for an other poor 
sod to remove them.

...

>>relevant for the inflections and the headword.. A Wordtype indicates a
>>noun a verb an adjective etc.
>>        
>>
>I still don't understand why is gender singled out of all properties a
>word could have. For example, a verb could be transitive or intransitive,
>and this information is important. To give an example:
>
>      
> [...]

 At this moment in time I would not have
intransitive verbs or transitive
verbs at all. To me they are verbs. When they are transitive, they have
a different meaning from when they are intransitive so to me the
destinction is in the meaning.

OK, for a better example, why not number? Perhaps transitivity doesn't, but 
number also affects inflection, much as gender does.

 When it comes to meaning, all the inflections can share the same 
meaning. The number (first, second, third person) will be implied by the 
Inflection in the table Inflection-Word. (at this moment it still says 
Conjugation in this table)

...

  OK, but
what if you have a longer phrase as a table field? For example, an
"inflection" in table "inflection" might be "male genitive
superlative" or
"3rd person plural female past". I don't think it makes sense to add such
phrases to the dictionary as proper entries, only so that the dictionary
would have translations of them.

Are some table fields inherently translatable? Is this what you had in
mind above?

 Most if not all text fields will be inherently translatable, this is
what I have very much in mind. The name of a font will not be translated
but that is the only one at this point in time. It makes perfect sense
to have this in the UW as it allows us to have a self learning User
Interface. The thing is; it has function.

OK, so this solves it :)

  I was
thinking about something else; for example, on
http://en.wiktionary.org/wiki/account there is this example: "A beggarly
account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I
understand now it is going to be just a part of "meaningtext". I'm not so
certain, but maybe it would be good to create a separate table for
examples, because same examples could (and probably will) be used in
"meaningtext"s in different languages. It would also make it easier to
automatically add new examples (for example, by grepping Project
Gutenberg ;)

 "A beggarly account of empty boxes" is a quote and why not have it as a
seperate Word and marked as such ?? It would be a idiom for "account"
and this is linked through Relation. Many famous quotes have been
translated and we could have them all. (Een paard , een paard, een
koninkrijk voor een paard)

Because, ideally each word (in each language) should have an example or two, 
and so the number of examples would approach the number of words; and, it 
would become impossible to distinguish between notable quotes (Kingdom for a 
horse!), which occur frequently, need a description, and need to be 
canonically translated, and non-notable quotes, which are in the wiktionary 
only to be used as examples of use for other words, need not have a 
description, and translators won't encounter them at all.

 The idioms, proverbs and quotes will be "Word" records in their own 
right. So we have to be selective in the idiom that we choose. What is 
new ?? That is what the editorial process is for. For instance for the 
Dutch French speaking people the phrase "Papa fume un pipe" is famous 
and as such it is noteworthy but its significance will bewilder the 
French.. :)

...

 I do not think grepping Project Gutenberg makes
much sense. If anything
it helps you find occurances of the word but you have to be selective of
what to include. That is an editorial process and just the fact that a
word is used does not make for a good idiom in the UW.

I think it would make sense for rarer words, which might occur once or a few 
times in entire Gutenberg's corpus. Of course, at the end a human editor has 
to decide whether a quote is really relevant.

Related to grepping Project Gutenberg, have you considered adding information 
on word frequency? Only a single new table is needed, "frequency", with 
fields "spellingID", "corpus" and "frequency"; eventually
"corpus" colud 
become "corpusID".

 There is more to frequency than that. If anything grep may find it but 
you still need to know the meaning of the word in that text. When the 
word gets a new meaning, that is what you want to know .. I will speak 
to the people of Rotterdam CS (developers of Lucene) about just these 
kind of issues.

...
 Once the UW is up and running, how hard would it be to
make such changes?

 This is the time when it is easy to make fundamental changes to the 
design of the UW, it is still also the time to come with an alternative 
to the design I propose and as you have noticed, I do change things when 
there is a good argument to do so. When the UW is life, changing the 
software will be more difficult.

...

  When I was
referring to "dialect", I did not have in mind a dialect that
is officially recognised, but simply a set of words which could be
identified as belonging to a certain group. So if you want to say that
this word was part of London dockworkers' slang in 1800s, you should be
able to do so, and not just stamp it with "British English".

 When there are words that are specific to London dockworkers in the
1800s, I would not call it a dialect because like many professions they
have there own vocabulary. These I would mark within a collection as the
bulk of what they say would be London English of the 1800s. Now there is

I agree, it is not a dialect, but if some words are recognisable as belonging 
to a distinctive group of words, they should somehow be marked as belonging 
to it, and I was suggesting that they are marked in a same way they would be 
marked as belonging to a certain dialect. Another solution would be to use 
"wordrelation" table instead, even though it isn't meant to be used in that

way :)

 Collection is the mechanism of choise for this. Relation is to indicate 
thesaurus like structures including antonymes..

...
  one thing that
is relevant, the UW wants all words of all languages but
its primary purpose it to have the current vocabulary. So yes, these
words exist and have their place but when they are not used anymore they
should be marked as such.

Well, just replace 1800s with 2000s and you still have the same problem :)

 These words are still welcome and the Collection is there for it.

...

  As a
simple example, in Serbia, there are several publishing houses that
were publishing Asterix, and in some translations "Idefix" is named
"Garoviks" and "Panoramix" is named "Aspiriniks" while in
others "Idefix"
is named "Idefiks" and "Panoramix" is named "Panoramiks";
and this is
consistent. If you are going to translate something about Asterix to
Serbian, you should pick one of the translations, but you should be
consistent in using only the words from the translation which you have
picked, and they should somehow be marked as belonging to the same
translation. There surely are more important things than Asterix where
similar might apply.

 Garoviks and Idefiks are for the Serbian language synonyms and as such I
do not have to choose because they are both correct. As a matter of
interest you could explain things either in the etymology or in the
meaning of the word.

They are synonyms, but they are stylistically marked: it would be wrong to 
translate Idefix first as Garoviks and later as Idefiks, or to consistently 
translate Idefix with Idefiks but Panoramix with Aspiriniks, much as it would 
be wrong to write "I recognise you recognized me"; a translator has to choose 
and make the choice consistent.

 A translator has to make a consistent choise, Collections of translated 
names of Asterisk characters can be used for that. We have the 
technology. :)

...
 Unrelated to any of the above, could you move
"word" table a bit to the right, 
because currently it is hard to see what is relation between "word", 
"spelling" and "etymology" tables, the lines overlap.
 I did put the table Word out of whack to show its importance. I put 
Collection on the same level as Meaning because that one too is very 
important for several applications. Table is technically challeging and 
that is why it is also given some prominence

Thanks,
    GerardM

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary