Hey Everyone,

David figured out how the Cybozu ES language detection plugin works in more detail, and figured out how to limit languages and how to retrain the models.

The results are big improvements that bring performance more in line with the results we're seeing from TextCat.

Initial results are below, for queries with spaces appended before and after (which improved performance on the old models—I'll verify that's still the case).

Below are the summary stats for the all old language models, the old models limited to "useful" languages, and new models, retrained on the (admittedly messy) query data used for TextCat training. The evaluation set is the manually tagged enwiki sample.

The full details will be posted on this page shortly:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation

All langauges, old models

f0.5 f1 f2 recall prec total hits misses

54.4% 47.4% 41.9% 39.0% 60.4% 775 302 198

Limited languages, old models (en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)

f0.5 f1 f2 recall prec total hits misses

75.6% 71.0% 67.0% 64.5% 79.0% 775 500 133

Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)

f0.5 f1 f2 recall prec total hits misses

81.8% 79.2% 76.9% 75.4% 83.5% 775 584 115

David suggests that this means we should go with TextCat, since it's easier to integrate, and I agree. However, this test was pretty quick and easy to run, so if we improve the training data, we can easily rebuild the models and test again.

Overall, it's clear that limiting languages to the "useful" ones for a given wiki makes sense, and training on query data rather than generic language data helps, too!

—Trey

Trey Jones

Software Engineer, Discovery
Wikimedia Foundation