Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary

27 Jul 2005

Nikola Smolenski wrote:

...
 On Monday 25 July 2005 11:04, Gerard Meijssen wrote:

 Nikola Smolenski wrote:

 On Saturday 23 July 2005 10:13, Gerard Meijssen
wrote:

 Nikola Smolenski wrote:

>On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
>          
>
>>Nikola Smolenski wrote:
>>            
>> Then I have misunderstood the database design :( I believed at first that
inflections would be stored in "inflection" table. Now when I understand
the design better, I don't think that it is a good idea to have separate
"word" for each inflection because it brings a lot of unneccesary
redudance, and much room for error. For example, it would be possible to
mark "whiter" as an adverb and "white" as a verb! And then, imagine
the
horror which would ensue if someone would use wrong PartOfSpeech for base
word and now it has to be changed for 100 inflections...

Though this would be a crucial change, please think about it. I think that
"word" table should contain only lemmas.

 Right, well this is very much a design decision. The inflections will
have to be entered by hand. And if some poor sod does enter all these
inflections and they are wrong, there will be the need for an other poor
sod to remove them.

Well, I see it as a bad design decision.

First, the inflections don't have to be entered by hand. If a word is not 
irregular, the inflections could, and should, be entered automatically.

Second, I don't understand this boasting of a flaw. If a problem with database 
structure is noticed, it should be solved. At the very very least it should be 
concluded that the problem can't be solved. Instead you are telling me that 
users will have to work around the problem. I knew that already, but do you 
see a solution?

 First off all, if creating inflections is done programmatically, it is 
not part of the database design. The database design says that there 
will be a record for each inflection. The inflections are translated as 
every other word is, there is Spelling for it. This means that these 
words have an importance in their own right and that is more than just 
the sharing of the meaning with a headword. So I do not share your 
argument at all. Yes, we can generate inflections but this WILL result 
in new Spelling - Word - Meaning. And as long as we do not have software 
to do this for us, we will have to do it by hand.

...

>>relevant for the inflections and the headword.. A Wordtype indicates a
>>noun a verb an adjective etc.
>>            
>>
>I still don't understand why is gender singled out of all properties a
>word could have. For example, a verb could be transitive or
>intransitive, and this information is important. To give an example:
>          
> [...]

 At this moment in time I would not have
intransitive verbs or transitive
verbs at all. To me they are verbs. When they are transitive, they have
a different meaning from when they are intransitive so to me the
destinction is in the meaning.

 OK, for a better example, why not number? Perhaps transitivity doesn't,
but number also affects inflection, much as gender does.

 When it comes to meaning, all the inflections can share the same
meaning. The number (first, second, third person) will be implied by the
Inflection in the table Inflection-Word. (at this moment it still says
Conjugation in this table)

By number I meant singular/plural. But regardless, why then gender wouldn't be 
specified in inflection-word?

 Because it is important to know for a noun what its gender is. When you 
know "probleem" (neutral) you know by inference that the idiom "het 
probleem is groter dan ik dacht" is correct because the neutral implies 
"het". This is the base knowledge that may be expected or when we go the 
extra mile and you do not know about genders, you may be led to an 
article about a gender in a particular language.

...
    >I was
thinking about something else; for example, on
>http://en.wiktionary.org/wiki/account there is this example: "A beggarly
>account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I
>understand now it is going to be just a part of "meaningtext". I'm not
>so certain, but maybe it would be good to create a separate table for
>examples, because same examples could (and probably will) be used in
>"meaningtext"s in different languages. It would also make it easier to
>automatically add new examples (for example, by grepping Project
>Gutenberg ;)
>          
>
"A beggarly account of empty boxes" is a quote and why not have it as a
seperate Word and marked as such ?? It would be a idiom for "account"
and this is linked through Relation. Many famous quotes have been
translated and we could have them all. (Een paard , een paard, een
koninkrijk voor een paard)

 Because, ideally each word (in each language) should have an example or
two, and so the number of examples would approach the number of words;
and, it would become impossible to distinguish between notable quotes
(Kingdom for a horse!), which occur frequently, need a description, and
need to be canonically translated, and non-notable quotes, which are in
the wiktionary only to be used as examples of use for other words, need
not have a description, and translators won't encounter them at all.

 The idioms, proverbs and quotes will be "Word" records in their own
right. So we have to be selective in the idiom that we choose. What is
new ?? That is what the editorial process is for. For instance for the
Dutch French speaking people the phrase "Papa fume un pipe" is famous
and as such it is noteworthy but its significance will bewilder the
French.. :)

Problem is, for ultimate majority of words we will have to choose a 
non-notable quote as an example.

Maybe we don't understand each other: maybe this isn't the case with other 
languages, but in a dictionary of Serbian that I have, *EACH* word has at 
least one, usually two, sometimes even more examples, from common words like 
"what" to rare and complex words. At least for Serbian and other languages 
with same lexicographic tradition we will want to do the same in the 
Wiktionary.

 That is fine with me. Use notable quotes if possible and when you do not 
have them use non notable quotes.

...

   I do not think grepping Project Gutenberg makes much
sense. If anything
it helps you find occurances of the word but you have to be selective of
what to include. That is an editorial process and just the fact that a
word is used does not make for a good idiom in the UW.

 I think it would make sense for rarer words, which might occur once or a
few times in entire Gutenberg's corpus. Of course, at the end a human
editor has to decide whether a quote is really relevant.

Related to grepping Project Gutenberg, have you considered adding
information on word frequency? Only a single new table is needed,
"frequency", with fields "spellingID", "corpus" and
"frequency";
eventually "corpus" colud become "corpusID".

 There is more to frequency than that. If anything grep may find it but
you still need to know the meaning of the word in that text. When the

This is why "frequency" is related to "spelling" and not to
"meaning". Change 
of meaning is not the only useful thing which could be gathered from a 
frequency analysis.

 word gets a new meaning, that is what you want to
know .. I will speak
to the people of Rotterdam CS (developers of Lucene) about just these
kind of issues.

A corpus could (would) be as small as a single text, usually a book. So, you 
would be able to extract frequency in any desired timespan, or observe how it 
changes over time.

  I agree,
it is not a dialect, but if some words are recognisable as
belonging to a distinctive group of words, they should somehow be marked
as belonging to it, and I was suggesting that they are marked in a same
way they would be marked as belonging to a certain dialect. Another
solution would be to use "wordrelation" table instead, even though it
isn't meant to be used in that way :)

 Collection is the mechanism of choise for this. Relation is to indicate
thesaurus like structures including antonymes..

Wait, "collection" is related to "meaning" and not to
"word". I don't see how 
could it be used for such things. It would be possible to have names of 
Asterix characters/Disney characters/whatever grouped together, and that is 
good. But it still isn't possible to distinguish between two groups of 
translations of names of Asterix characters. It would be possible to have all 
words related to seamanship grouped together, but it would not be possible to 
mark which of these are dockworkers' slang, which are sailors' slang, and 
which are not slang.
 You can have multiple collections; the names of one translation 
tradition can be one collection the other the other. The two words for 
Asterisk can be used as synonym.

Thanks,
    GerardM

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary