Hi Everyone,
I've just finished my write-up for optimizing the languages that could
eventually be used for language detection on French Wikipedia. (Spanish,
Italian, and German are still to come.)
The full write-up
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki>
gives details
on corpus creation and clean up, performance stats, and more.
Briefly, about 15% of "low performing" queries (those with < 3 results) are
easily filtered junk, and 65% of the remainder are not in an identifiable
language (e.g., names, acronyms, more junk, etc.).
Based on a sample of 682 poor-performing queries on frwiki that are in some
language, about 70% are in French, 10-15% are in English, about 7-12% are
in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there
are a handful of other languages present.
Because of the relatively low percentage of low-performing queries that are
relevant, we will still need to do an A/B test before discussing deploying
this to frwiki. An A/B test on enwiki
<https://phabricator.wikimedia.org/T121542> in in the works at the moment.
The optimal settings for frwiki, based on these experiments, would be to
use the TextCat query-based models for French, English, Arabic, Russian,
Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el,
hy, he, ko), using the default 3000-ngram models.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation