Wikidata

wikidata@lists.wikimedia.org

4135 discussions

by Milos

16 years, 6 months

[Wikidata-l] Some methods for generation of dictionaries

by Milos Rancic

I am forwarding to you the first (not complete) version of the page http://meta.wikimedia.org/wiki/User:Millosh/Dictionaries . At the end of this month I'll have some software for such generation of dictionaries. So, it would be good to hear what do you think about that and is there someone interested to join this project. Maybe Gerard may think how to implement such thing in the OmegaWiki, too :) The page is not completed (stages 2 and 3 are not described), but I think that you may follow my idea anyway. I'll complete the page in the next few weeks and I'll inform you about that. * * * In this moment I am working on one Serbian dictionary of synonyms. During that work I got some ideas about the work on Wiktionaries: Let's say what one word with synonyms/translations is enough for one word in Wiktionary. (Maybe I should read some Wiktionary documentation, but I suppose that this is the minimum.) In short, this may be done for a dozens of languages on a dozens of Wiktionaries. ==Stage 1, one language dictionary== *Take some dictionary between English (or whatever language) and your language. Of course, take it in machine readable format (not encrypted). *Take the first word in (let's say) English. *Take the first translation in your language. Connect this word in your language with other translations of the word in English. *Find which words in English have the same translation. Connect the word with other translations in those words. *You will get the list of connected words. There will be a lot of mass, but you will be able to make some simple methods for cleaning the most of the mass. The rest of the mass will be cleaned by humans because this is a wiki :) *Of course, you may do that with a lot of different dictionaries... Imagine that we analyzed two words from language A in the dictionary "language B -> language A" and that we got the next results (of course, this is simplified table): <pre> A58 - B65 - A58, A43, A21, A63 - B69 - A58, A28, A21, A38 - B71 - A58, A43, A21, A88 - B89 - A58, A43, A21, A63 A21 - B31 - A21, A43, A76, A20 - B44 - A21, A43, A39, A22 - B65 - A58, A43, A21, A63 - B69 - A58, A28, A21, A38 - B71 - A58, A43, A21, A88 - B89 - A58, A43, A21, A63 </pre> We may say that if one word from the language A has the same meaning as the word A58 in the language B, this connection will get one point. So, we will have the next situation according to the words A58 and A21: <pre> A58(A21) = 4 A58(A43) = 3 A58(A63) = 2 A58(A28) = 1 A58(A38) = 1 A58(A88) = 1 A21(A43) = 5 A21(A58) = 4 A21(A63) = 2 A21(A28) = 1 A21(A38) = 1 A21(A88) = 1 A21(A76) = 1 A21(A39) = 1 A21(A20) = 1 A21(A22) = 1 </pre> For the beginning, this may mean: *The closest synonyms to the word A58 is the word A21. *The closest synonyms to the word A21 is the word A43. *Words A21, A58, A43 and A63 are synonyms (which we may call "G(As)1"). *It seems that words A28, A38, A88, A76, A39, A20 and A22 are not related with the group G(As)1. However, we will put the connections in the memory, but we will not write it into the dictionary. Imagine that the word ''blood'' literary means in some language "red bird". Of course, there are some ''red birds'' in the area where that language is spoken. So, in this sense, blood will be connected with the word "bird" and, almost for sure, with some specie of birds. However, this will be the only connection to the birds. Other connections will be inside of the descriptions for erythrocyte, lymphocyte, heart and so on. Of course, mistakes are possible, but we may analyze results :) *This may be very useful for smaller languages which have some two language dictionaries (where the language B is English). We may be able to generate one language Wiktionaries for all of such languages. ==Stage 2, two languages dictionary== (To be continued.) ==Stage 3, cross language dictionaries== (To be continued.)

17 years, 2 months

[Wikidata-l] Lists to be collected

by Sabine Cretella

http://de.wikipedia.org/wiki/Liste_der_byzantinischen_Kaiser this here is a list that can be worked on with translations and basic data - where should we place links to lists where we can work on? Ciao, Sabine ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it

17 years, 8 months

Re: [Wikidata-l] Creating content for many Wikipedias

by Purodha B Blissenbach

Basically I am with Sabine and support the idea. Yet I want to warn against doing it now and doing it quickly so as to avoid certain pitfalls. I suggest, rather to develop & train bots, data, and algorithms with the test wikipedia only, for the time being, where spefic situations can easily be (re)created without risk of havoc. What follows are details and reasons, you can safely stop reading here if not interested. I've been mass inserting data in the ripuarian test wikipedia at a semi autmated level, which I created from several small database like collections, such as: - names, and ISO codes of languages having Wikipedias, - dates and mottos of carnival parades in the city of cologne of the last 185 years. - redirects for dialectal, and spelling variants - etc. So I've (limited) experience. Pitfalls to be avoided. ---------------------- If we have inserted data in a WP already, and later a refined version of that data becomes available, we want to pass that to the WP. This becomes complicated when an article already exists for a record. Thus we may strategically choose to export data as late as possible, at an as complete state as possible, when general additions and amendments have become unlikely, and the data structure is stable. We can safely replace articles, when we can determine that they have been unaltered since our own last update - i.e. we need to be able to look at the version history for those cases. When an article has been conventionally updated by an editor, that may mean, he altered data, which we originally supplied, and that we have to update our source before we may re-exort data to WPs again. It is possible, that an update made in one WP shall influence others as well, though this is not neccessarily so. When we say, we supply only some specific data to an article, e.g. an infobox, then we can reread the infobox, and if it has not been altered, we can rewrite it for an update. We can also use such infoboxes to import new data from WPs, when they have been altered, e.g. someone died. We should have, however, some protection agains collecting errors, garbage, and vandal drivel. Both such uses should imho be documented by comments in the wikicode of the articles in question. Editors must know of the implications of their edits. Summarizing all this, I'd suggest to carefulls plan, and test drive, all aplications having the least chance to be more than sheer article-creation-and-the-leave-it-alone-forever projects. Language. -------- Another field needing attention is language. A pretty huge number of names (of persons, places, langages, etc.) are identical between languages, or are transliterated somehow, or undergo systematic transformations (e.g. of the kind that Estonian versions of male names have 'as' appended to them, afaik) etc. The rule of thumb is that for lesser-known distant things (places, languages, persons, etc.) the existance of special or irregular translations is very unlikely. That may mean, we can compile a set of transformation rules, and an exception lookup mechanism (e.g. in WiktionaryZ) and pretty well assume, when exceptions are not found, that we can use the rules. Naturally, when this assumption fails, we need to have a feedback path from the respective language community, that allows us to "repair" errors. Since in most Wikipedias there are editors reviewing all, or most, new articles, we can assume feedback to be rather quick and reliable. Finding the right, grammar, wording etc. for automatically generated non-tabular content is quite an interesting task which I'll not address here any further ;-) Community aspects. ----------------- Wikies not having alert proofreaders should imho not be filled with much automated content, since this might be a remarkable hindrance for community buildup. The amount of newly inserted automated data should be determinable by wiki admins, and generally it might be wise to make it somehow related to the number of edits of any given time period, so as not to overload the community. How wiki admins find the right figures should imho be left to them, valid suggestion might be by public voting, or taken from experience of how thoroughly data can be verified. Alo, keeping data up to date needs imho to be negotiated with the communities. I bet, we'll receive several interesting ideas of how this could be accomplished without interfering with potential human editors too much. Greetings to all Purodha -- e-mail: <wikidata-l.mail.wikimedia.org(a)publi.purodha.net>

17 years, 9 months

[Wikidata-l] [Fwd: [words & more] Creating contents for many Wikipedias]

by Sabine Cretella

well I am forwarding my blog here - it si probably not perfect, but a start of what I have in mind. I just get interrupted all too often. best, sabine -------- Original-Nachricht -------- Betreff: [words & more] Creating contents for many Wikipedias Datum: Thu, 31 Aug 2006 01:41:45 -0700 (PDT) Von: Sabine Cretella <sabine_cretella(a)yahoo.it> An: sabine_cretella(a)yahoo.it The basis to this is a project about mass contents creation on meta and Wikidata. Mass contents creation is an idea of user Millosh and yes, he is right about that - it is how I already did certain stuff for the Neapolitan wikipedia. There is so much easy to create contents out there that Wikipedias could share easily and even if we will not have Wikidata implemented into wikipedias we can use the data in databases to create stubs by using Mailmerge (in OpenOffice.org or Word) and upload them with the bot. (see my other post of today). This means: if we now start to add all names of: continents countries cities rivers mountains monuments places yes, even streets, because there are some who have translations lakes seas animals plants names of people (also these are translated) etc. And then we start to translate them. At the same time people care to add statistical data to a table that is exactly about this (if we cannot do this with a separate wikidata installation ... anyway we do not need relational information for now ... just information). How many articles (stubs) can be created in this way and how many people can work on it? We also should not forget about film and book titles, the Greek and Roman gods (I suppose other parts of the world will have other material on such tings). It is really a huge project, but it is feasable ... there are many of us who have similar goals. Where to start: well we need the infoboxes translated into as many languages as possible - and we then need the place names etc. translated. This must be combined with a datasheet. Example: Castua: http://it.wikipedia.org/wiki/Castua We have the box on the right side with all the statistical/basic information - all that can be translated into many languages. The first sentence in the stub will simply be the definition of WiktionaryZ. So most of it like stato (state), regione (region) etc. can be translated within WZ and - in that way we can populate the templates used to all wikis. As for the not "not seen part" of the template I would use or a lingua franca (English) or simply the same names that are visible. There's not much about it - we need the lists to start off with. If we use the pagefromfile.py to upload the ready pages existing ones with the same name will be skipped and written in a logfile - these are then the only ones someone has to look after manually. If sooner or later we get a pure wikidata application that takes the translations from WZ and combines them with the rest of the data: that would be great ... since that would avoid that we need to correct the entries when there are corrections. Using the Geoboxes we already have a good way to compare lists ... but does it make sense to do it that way right now? Or does it make sense to prepare now all possible translations to be ready once we can have wikidata for geographical entries? Hmmm ... I was interrupted quite often while writing this blog ... and I don't have the time to re-read now. So sorry if things seem to be a bit mixed up. -- Posted by Sabine Cretella to words & more <http://sabinecretella.blogspot.com/2006/08/creating-contents-for-many-wikip…> at 8/31/2006 10:41:00 AM Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com

17 years, 9 months

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikidata