Hey Everyone,

David figured out how the Cybozu ES language detection plugin works in more detail, and figured out how to limit languages and how to retrain the models.

The results are big improvements that bring performance more in line with the results we're seeing from TextCat.

Initial results are below, for queries with spaces appended before and after (which improved performance on the old models—I'll verify that's still the case).

Below are the summary stats for the all old language models, the old models limited to "useful" languages, and new models, retrained on the (admittedly messy) query data used for TextCat training. The evaluation set is the manually tagged enwiki sample.

The full details will be posted on this page shortly:
    https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation

All langauges, old models
f0.5    f1      f2      recall  prec   total   hits    misses
54.4%   47.4%   41.9%   39.0%   60.4%  775     302     198

Limited languages, old models (en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5    f1      f2      recall  prec   total   hits    misses
75.6%   71.0%   67.0%   64.5%   79.0%  775     500     133

Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5    f1      f2      recall  prec   total   hits    misses
81.8%   79.2%   76.9%   75.4%   83.5%  775     584     115

David suggests that this means we should go with TextCat, since it's easier to integrate, and I agree. However, this test was pretty quick and easy to run, so if we improve the training data, we can easily rebuild the models and test again.

Overall, it's clear that limiting languages to the "useful" ones for a given wiki makes sense, and training on query data rather than generic language data helps, too!

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation