[Foundation-l] WikiTrans Support for All Languages

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Fri Aug 18 04:06:27 UTC 2006


Jeffrey V. Merkey wrote:

>Sabine Cretella wrote:
>
>  
>
>  
>
>>Yes, this is a wikimedia project, but has less than 10,000 articles. 
>>Therefore your entry does not go into the first part of the list but 
>>under "other languages". And that is where I now moved your entry.
>>
>>I hope you understand that we understand well how relevant Cherokee for 
>>the regional languages is (please don't understand the term regional). 
>>Nonetheless all of us have to follow these rules and many languages are 
>>still below the 10,000 hurdle. I hope to see the number 1000 soon - let 
>>me know when you reach it ... it is hard work for a small community. 
>>Once you have 1000 the step to reach 5000 is much faster and from there 
>>to 10,000 is faster again.
>>
>>I know you are doing loads of work with machine translation on the 
>>non-wikimedia wiki - unfortunately, up to now I did not manage to 
>>install a proper font to be able to read that wiki - I tried more than 
>>once, but for some reason my computer refuses to install the font :-(
>>
>>Best wishes from Italy,
>>
>>Sabine
>>Chiacchiera con i tuoi amici in tempo reale! 
>>http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 
>> 
>>
>>    
>>
>
>Sabine,
>
>I think it's time to approach this subject based on your explanation of 
>language evolution. I make you an offer
>to speed italian translation (and any other language) for your Wiki.
>
>At present, I have wikitrans adapted to read XML dumps from the English 
>Wikipedia, dissect english using the CMU
>lnk grammar parser, performing a lexicon word by word translation, then 
>running a conjugator and verb constructor
>over the translated sentences and reordering noun-verb pairs and 
>morpheme and outputing into a target langauge with proper,
>person, tense, and plurality. WikiTrans is setup to use hierarchical 
>lexicons and hierarchical thesarauses for English.
>I can convert over 98% of English at present. The lexicon, thesaurus, 
>conjugators, and inference engine are language
>neutral by design -- I can use it on any language.
>
>I also have an AI inference engine which can tune the output to specific 
>dialects. It works as follows:
>
>After I output into the target language in order to "teach" the 
>inference engine a set of rules, I take a dozen or so articles
>and go through them by hand and correct them then run the inference 
>engine comparing and recording tensing and minor
>corrections -- in other words it "learns" how to properly construct for 
>a unique dialect. I currently can tune it to output
>into Giduwa or Otali in Cherokee by altering the lexicon hierarchy for a 
>target dialect.
>
>I can parse and translate the entire Wikipedia 7GB XML dump in about 15 
>minutes. What can be done here is for me to
>output into say -- italian and speed up the process of editing for a 
>target language and inputing articles by a proofreader.
>
>In the current version, every 5th article or so in Cherokee requires me 
>to proofread them and correct very subtle errors in
>tense or noun disambiguation, so its not perfect, but its better then 
>writing them by hand, and most of it is over 95% correct.
>
>If you are interested, download the lexicons and roget thesauraus from 
>the FTP server at ftp.wikigadugi.org/wiki and replace
>the cherokee words with italian or whatever and let me know where to get 
>them. will then post runs for any target language
>and you can correct a dozen or so long articles and send me the files 
>back and I will run the inference engine and create table
>rules that would give you close to 98% accuracy when compared against 
>the English Wikipedia. It will rapidly speed translations
>to other languages and allow the non-english wiki's a better chance of 
>catching up. After we build the syntax rule databases
>for each language, I'll give them back to you with an additional 
>extension that will allow you not only to translate wikipedia, but
>also put a front end on a proxy server and allow web browsers to access 
>websites and translate pages and web content real time
>similair to what google is doing now.
>  
>

I want to use this with MediaWiki for external links from articles that 
will allow real time translation
of linked content from other sites wikipedia refers to if they are 
english. I have not addressed going from
non-english to other languages, but these extensions could be developed 
and I am not oppposed to
opening up the translator. I want to get further along with language 
neutral abstraction layers
in WikiTrans before we open it up.

Jeff

>If you and the other non-english editors are interested, download and 
>populate the lexicons (I am at 230,000 words and phrases at present)
>and everytime Wikipedia posts an XML dump, I'll machine translate them 
>and you can download them and use them to populate the
>other Wiki's. We will need a system to exclude articles already reviewed 
>and translated.
>
>My wife and son in law are going to help me get the German lexicons and 
>rule sets created for Deutch, the other languages are wide open.
>
>Jeff
>_______________________________________________
>foundation-l mailing list
>foundation-l at wikimedia.org
>http://mail.wikipedia.org/mailman/listinfo/foundation-l
>
>  
>




More information about the foundation-l mailing list