Re: [Wikidata] Machine translation efforts for underserved languages

18 Jun 2018

Hi Olya, Lucie, and Wikidatans,

Very interesting projects. And thanks for publishing, Lucie - very helpful!

With regard to Swahili, Arabic (both African languages!) and Esperanto, and
leveraging Google Translate / GNMT, I've been looking at this Google GNMT
gif  image -
https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYe…
- and wondering how the triplets of the Linked Open Data of Wikidata
structured Knowledge Base (KB) would stream through this in multiple
smaller languages?

I couldn't deduce from this paper - https://arxiv.org/pdf/1803.07116.pdf -
here, for example ...

2.1 Encoding the Triples The encoder part of the model is a feed-forward
architecture that encodes the set of input triples into a fixed
dimensionality vector, which is subsequently used to initialise the
decoder. Given a set of un-ordered triples FE = {f1, f2, . . . , fR : fj =
(sj , pj , oj )}, where sj , pj and oj are the onehot vector
representations of the respective subject, property and object of the j-th
triple, we compute an embedding hfj for the j-th triple by forward
propagating as follows: hfj = q(Wh[Winsj ;Winpj ;Winoj ]) , (1) hFE =
WF[hf1 ; . . . ; hfR−1 ; hfR ] , (2) where hfj is the embedding vector of
each triple fj , hFE is a fixed-length vector representation for all the
input triples FE. q is a non-linear activation function, [. . . ; . . .]
represents vector concatenation. Win,Wh,WF are trainable weight matrices.
Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the
order of input triples. As a result, the order of a particular triple fj in
the triples set does not change its significance towards the computation of
the vector representation of the whole triples set, hFE .

... whether this would address streaming triplets through GNMT?

Would this? And since Swahili, Arabic and Esperanto, are all active
languages in - https://translate.google.com/ - no further coding on the
GNMT side would be necessary. (I'm curious how best for WUaS to grow small
languages not yet in either Wikipedia/Wikidata's 287-301 languages or in
GNMT's ~100+ languages?).

How could your Wikidata / Wikibabel work interface with Google GNMT more
fully with time, building on your great Wikidata coding/papers?

Cheers,
Scott

https://en.wikipedia.org/wiki/User:Scott_WUaS

On Mon, Jun 18, 2018 at 5:17 AM, Gerard Meijssen &lt;gerard.meijssen(a)gmail.com&gt;
wrote:

...
  Hoi,
 On average there is little or no support for subjects that have to do with
 Africa. When I check the articles for politicians for instance, I find that
 even current presidents let alone ministers are missing in African
 Wikipedias. So it is wonderful that there have been projects that deal with
 gaps but what if there is hardly anything?

 What this approach brings us is at least information. Basic information in
 lists, info boxes maybe an additional line of text.

 What we apparently have not done is learn from the Cebuano experience. The
 biggest issue was not the quality of the new information, it is the
 integration with Wikidata. Everything is new and it did not link with what
 we already knew. What we bring in this way is integrated information and as
 long as data is not saved as an article, the quality provided improves as
 Wikidata gains better intel.

 If anything, the experience of the Welsh Wikipedia brings us more than
 gapfinder or tiger editathon because of this is more in line with this
 approach.
 Thanks,
      GerardM

 On 18 June 2018 at 13:19, Amir E. Aharoni &lt;amir.aharoni(a)mail.huji.ac.il&gt;
 wrote:

 ‬

 2018-06-18 2:12 GMT+03:00 Olya Irzak &lt;oirzak(a)gmail.com&gt;om>:

  Dear Wikidata community,

 We're working on a project called Wikibabel to machine-translate parts
 of Wikipedia into underserved languages, starting with Swahili.

 In hopes that some of our ideas can be helpful to machine translation
 projects, we wrote a blogpost about how we prioritized which pages to
 translate, and what categories need a human in the loop:
 https://medium.com/@oirzak/wikibabel-equalizing-information-
 access-on-a-budget-4038f750e90e

 Rumor has it that the Wikidata community has thought deeply about
 information access. We'd love your feedback on our work. Please let us know
 about past / ongoing machine translation related projects so we can learn
 from & collaborate with them.

 I'm not sure how has the Wikidata community think deeply about it.

 One project that does something related to what you're doing is GapFinder
 ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the
 GapFinder frontend is not developed actively, but the recommendation API
 behind it is being actively maintained and developed, but you should ask
 the Research team for more info (see https://www.mediawiki.org/wiki
 /Wikimedia_Research ).

 Project Tiger is also doing something similar:
 https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018

 As a general comment, displaying machine-translated text in a way that
 appears that is had been written by humans is misleading and damaging. I
 don't know any Swahili, but in languages that I can read (Russian, Hebrew,
 Catalan, Spanish, French, German), the quality of machine translation is at
 its best good as an aid during writing a translation by a human, and it's
 never good for actually reading. I also don't understand why do you invest
 credits into pre-machine-translating articles that people can
 machine-translate for free, but maybe I'm missing something about how your
 project works.

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

-- 

-- 
- Scott MacLeod - Founder & President
- https://twitter.com/WorldUnivAndSch
- World University and School
- http://worlduniversityandschool.org
- http://scottmacleod.com

- CC World University and School - like CC Wikipedia with best STEM-centric
CC OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Machine translation efforts for underserved languages