I am forwarding to you the first (not complete) version of the page
http://meta.wikimedia.org/wiki/User:Millosh/Dictionaries .
At the end of this month I'll have some software for such generation
of dictionaries. So, it would be good to hear what do you think about
that and is there someone interested to join this project. Maybe
Gerard may think how to implement such thing in the OmegaWiki, too :)
The page is not completed (stages 2 and 3 are not described), but I
think that you may follow my idea anyway. I'll complete the page in
the next few weeks and I'll inform you about that.
* * *
In this moment I am working on one Serbian dictionary of synonyms.
During that work I got some ideas about the work on Wiktionaries:
Let's say what one word with synonyms/translations is enough for one
word in Wiktionary. (Maybe I should read some Wiktionary
documentation, but I suppose that this is the minimum.)
In short, this may be done for a dozens of languages on a dozens of
Wiktionaries.
==Stage 1, one language dictionary==
*Take some dictionary between English (or whatever language) and your
language. Of course, take it in machine readable format (not
encrypted).
*Take the first word in (let's say) English.
*Take the first translation in your language. Connect this word in
your language with other translations of the word in English.
*Find which words in English have the same translation. Connect the
word with other translations in those words.
*You will get the list of connected words. There will be a lot of
mass, but you will be able to make some simple methods for cleaning
the most of the mass. The rest of the mass will be cleaned by humans
because this is a wiki :)
*Of course, you may do that with a lot of different dictionaries...
Imagine that we analyzed two words from language A in the dictionary
"language B -> language A" and that we got the next results (of
course, this is simplified table):
<pre>
A58 - B65 - A58, A43, A21, A63
- B69 - A58, A28, A21, A38
- B71 - A58, A43, A21, A88
- B89 - A58, A43, A21, A63
A21 - B31 - A21, A43, A76, A20
- B44 - A21, A43, A39, A22
- B65 - A58, A43, A21, A63
- B69 - A58, A28, A21, A38
- B71 - A58, A43, A21, A88
- B89 - A58, A43, A21, A63
</pre>
We may say that if one word from the language A has the same meaning
as the word A58 in the language B, this connection will get one point.
So, we will have the next situation according to the words A58 and
A21:
<pre>
A58(A21) = 4
A58(A43) = 3
A58(A63) = 2
A58(A28) = 1
A58(A38) = 1
A58(A88) = 1
A21(A43) = 5
A21(A58) = 4
A21(A63) = 2
A21(A28) = 1
A21(A38) = 1
A21(A88) = 1
A21(A76) = 1
A21(A39) = 1
A21(A20) = 1
A21(A22) = 1
</pre>
For the beginning, this may mean:
*The closest synonyms to the word A58 is the word A21.
*The closest synonyms to the word A21 is the word A43.
*Words A21, A58, A43 and A63 are synonyms (which we may call "G(As)1").
*It seems that words A28, A38, A88, A76, A39, A20 and A22 are not
related with the group G(As)1. However, we will put the connections in
the memory, but we will not write it into the dictionary. Imagine that
the word ''blood'' literary means in some language "red bird". Of
course, there are some ''red birds'' in the area where that language
is spoken. So, in this sense, blood will be connected with the word
"bird" and, almost for sure, with some specie of birds. However, this
will be the only connection to the birds. Other connections will be
inside of the descriptions for erythrocyte, lymphocyte, heart and so
on. Of course, mistakes are possible, but we may analyze results :)
*This may be very useful for smaller languages which have some two
language dictionaries (where the language B is English). We may be
able to generate one language Wiktionaries for all of such languages.
==Stage 2, two languages dictionary==
(To be continued.)
==Stage 3, cross language dictionaries==
(To be continued.)
http://de.wikipedia.org/wiki/Liste_der_byzantinischen_Kaiser
this here is a list that can be worked on with translations and basic
data - where should we place links to lists where we can work on?
Ciao, Sabine
___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it
Basically I am with Sabine and support the idea.
Yet I want to warn against doing it now and doing it quickly so as
to avoid certain pitfalls.
I suggest, rather to develop & train bots, data, and algorithms
with the test wikipedia only, for the time being, where spefic
situations can easily be (re)created without risk of havoc.
What follows are details and reasons, you can safely stop reading
here if not interested.
I've been mass inserting data in the ripuarian test wikipedia at
a semi autmated level, which I created from several small database
like collections, such as:
- names, and ISO codes of languages having Wikipedias,
- dates and mottos of carnival parades in the city of cologne of
the last 185 years.
- redirects for dialectal, and spelling variants
- etc.
So I've (limited) experience.
Pitfalls to be avoided.
----------------------
If we have inserted data in a WP already, and later a refined
version of that data becomes available, we want to pass that
to the WP. This becomes complicated when an article already
exists for a record. Thus we may strategically choose to export
data as late as possible, at an as complete state as possible,
when general additions and amendments have become unlikely, and
the data structure is stable.
We can safely replace articles, when we can determine that they
have been unaltered since our own last update - i.e. we need to be
able to look at the version history for those cases.
When an article has been conventionally updated by an editor, that
may mean, he altered data, which we originally supplied, and that
we have to update our source before we may re-exort data to
WPs again. It is possible, that an update made in one WP shall
influence others as well, though this is not neccessarily so.
When we say, we supply only some specific data to an article, e.g.
an infobox, then we can reread the infobox, and if it has not
been altered, we can rewrite it for an update.
We can also use such infoboxes to import new data from WPs, when
they have been altered, e.g. someone died. We should have, however,
some protection agains collecting errors, garbage, and vandal drivel.
Both such uses should imho be documented by comments in the
wikicode of the articles in question. Editors must know of the
implications of their edits.
Summarizing all this, I'd suggest to carefulls plan, and test
drive, all aplications having the least chance to be more than
sheer article-creation-and-the-leave-it-alone-forever projects.
Language.
--------
Another field needing attention is language.
A pretty huge number of names (of persons, places, langages, etc.)
are identical between languages, or are transliterated somehow, or
undergo systematic transformations (e.g. of the kind that Estonian
versions of male names have 'as' appended to them, afaik) etc.
The rule of thumb is that for lesser-known distant things (places,
languages, persons, etc.) the existance of special or irregular
translations is very unlikely.
That may mean, we can compile a set of transformation rules, and
an exception lookup mechanism (e.g. in WiktionaryZ) and pretty
well assume, when exceptions are not found, that we can use the
rules.
Naturally, when this assumption fails, we need to have a feedback
path from the respective language community, that allows us to
"repair" errors. Since in most Wikipedias there are editors
reviewing all, or most, new articles, we can assume feedback to be
rather quick and reliable.
Finding the right, grammar, wording etc. for automatically
generated non-tabular content is quite an interesting task which
I'll not address here any further ;-)
Community aspects.
-----------------
Wikies not having alert proofreaders should imho not be filled
with much automated content, since this might be a remarkable
hindrance for community buildup.
The amount of newly inserted automated data should be determinable
by wiki admins, and generally it might be wise to make it somehow
related to the number of edits of any given time period, so as not
to overload the community.
How wiki admins find the right figures should imho be left to
them, valid suggestion might be by public voting, or taken from
experience of how thoroughly data can be verified.
Alo, keeping data up to date needs imho to be negotiated with the
communities. I bet, we'll receive several interesting ideas of how
this could be accomplished without interfering with potential
human editors too much.
Greetings to all
Purodha
-- e-mail: <wikidata-l.mail.wikimedia.org(a)publi.purodha.net>
well I am forwarding my blog here - it si probably not perfect, but a
start of what I have in mind. I just get interrupted all too often.
best, sabine
-------- Original-Nachricht --------
Betreff: [words & more] Creating contents for many Wikipedias
Datum: Thu, 31 Aug 2006 01:41:45 -0700 (PDT)
Von: Sabine Cretella <sabine_cretella(a)yahoo.it>
An: sabine_cretella(a)yahoo.it
The basis to this is a project about mass contents creation on meta and
Wikidata. Mass contents creation is an idea of user Millosh and yes, he
is right about that - it is how I already did certain stuff for the
Neapolitan wikipedia.
There is so much easy to create contents out there that Wikipedias could
share easily and even if we will not have Wikidata implemented into
wikipedias we can use the data in databases to create stubs by using
Mailmerge (in OpenOffice.org or Word) and upload them with the bot. (see
my other post of today).
This means: if we now start to add all names of:
continents
countries
cities
rivers
mountains
monuments
places
yes, even streets, because there are some who have translations
lakes
seas
animals
plants
names of people (also these are translated)
etc.
And then we start to translate them. At the same time people care to add
statistical data to a table that is exactly about this (if we cannot do
this with a separate wikidata installation ... anyway we do not need
relational information for now ... just information).
How many articles (stubs) can be created in this way and how many people
can work on it?
We also should not forget about film and book titles, the Greek and
Roman gods (I suppose other parts of the world will have other material
on such tings).
It is really a huge project, but it is feasable ... there are many of us
who have similar goals.
Where to start: well we need the infoboxes translated into as many
languages as possible - and we then need the place names etc.
translated. This must be combined with a datasheet.
Example:
Castua: http://it.wikipedia.org/wiki/Castua
We have the box on the right side with all the statistical/basic
information - all that can be translated into many languages. The first
sentence in the stub will simply be the definition of WiktionaryZ.
So most of it like stato (state), regione (region) etc. can be
translated within WZ and - in that way we can populate the templates
used to all wikis. As for the not "not seen part" of the template I
would use or a lingua franca (English) or simply the same names that are
visible.
There's not much about it - we need the lists to start off with. If we
use the pagefromfile.py to upload the ready pages existing ones with the
same name will be skipped and written in a logfile - these are then the
only ones someone has to look after manually.
If sooner or later we get a pure wikidata application that takes the
translations from WZ and combines them with the rest of the data: that
would be great ... since that would avoid that we need to correct the
entries when there are corrections.
Using the Geoboxes we already have a good way to compare lists ... but
does it make sense to do it that way right now? Or does it make sense to
prepare now all possible translations to be ready once we can have
wikidata for geographical entries?
Hmmm ... I was interrupted quite often while writing this blog ... and I
don't have the time to re-read now. So sorry if things seem to be a bit
mixed up.
--
Posted by Sabine Cretella to words & more
<http://sabinecretella.blogspot.com/2006/08/creating-contents-for-many-wikip…>
at 8/31/2006 10:41:00 AM
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com