I think the final idea would be very similar to wikidata search.
Concerning wikidata search: could we reuse the code for language search
and trigger the search on the backend?
The JS code will send a query to
(using action=query) and
will generate a new search request on wikidata. Analyzing these logs
will be hard because we won't be able to associate the original query
and the query sent to wikidata.
By importing wikidata to the relevancy lab we could have a rough idea of
the impact on ZRR.
Le 04/11/2015 21:07, Erik Bernhardson a écrit :
Taking the above into consideration and reviewing what
we have in the
brainstorming session, the set of idea seems to be the following:
Do language detection on more than just zero result queries, how about
queries that only return 1 or 2 results
* Seems useful and doable, but will only effect satisfaction and not
the zero result rate. Still possibly worthwhile.
* This should be relatively easy to test with relevancy lab
2. Determine the language to search in via something other than
language detection (headers, geolocation, etc)
* Working up a couple heuristics wouldn't be too hard. The
webrequests table in hive has the accept language header and
geolocation info as well as the query string, so we could extract
a set of queries to test with
3. Integrate wikidata search
* This looks to be
https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
* We could integrate that more directly, can't be tested by
relevancy lab. It is basically just an additional set of results
below the existing results.
* Would need a significant cleanup to pass code review, but it's not
particularly hard to do
4. Translate the query from the provided language into the language
of the wiki being searched on
* This seems "very hard". Not only do we have to correctly detect
the language the user input, but then we have to translate that
into a second language
* The CX service might be able to provide us a translation endpoint
that works with whatever they are currently using, but will likely
have high latency. Our inability (currently) to do async requests
in php makes it harder to hide that latency.
Build an index that contains the titles from all wikis, but not much
else. This could be used to suggest the user search on other wikis (or
to inform the code that does actual searches on other wikis)
* This could be somewhat tested in relevancy lab, but first we would
have to build something to actually combine all the titles into
the same index.
*
I think any of the top three could be worked on, the first and the
second can be validated through relevancy lab. The third takes a
completely different approach and is not easily testable outside of
production, but may be useful. The fourth is "very hard" and i think
we should leave it alone for now. The fifth and final idea was only
put forth once, but is interesting. I'm not sure how valuable it would
be though.
On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson
<ebernhardson(a)wikimedia.org <mailto:ebernhardson@wikimedia.org>> wrote:
In terms of user language data we have, within the webrequests
table in hive we have the accept language header and we have
geolocation information. This table also contains the query
strings so we can extract the exact search terms and feed that
information into relevancy lab.
On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith <ksmith(a)wikimedia.org
<mailto:ksmith@wikimedia.org>> wrote:
So do we think we should favor the "try to guess the user's
language(s)" item over others that would benefit from the
relevance lab? Are there steps we could/should take in
advance, such as analyzing whatever user language data we
have, or instrumenting to get more if we don't have enough?
Kevin Smith
Agile Coach, Wikimedia Foundation
/
/
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones
<tjones(a)wikimedia.org <mailto:tjones@wikimedia.org>> wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages
the user can read, and searching in (a subset of) those.
This wouldn't benefit from relevance lab testing, though.
It'll need to be measured against the user satisfaction
metric. (BTW, Do we have a sense of how many users have
info we can detect for this?)
I think the biggest problem with language detection is the
quality of the language detector. The Elastic Search
plugin we tested has a Romanian fetish when run on our
queries (Erik got about 38% Romanian on 100K enwiki
searches, which is crazy, and I got 0% accuracy for
Romanian on my much smaller tagged corpus of failed (zero
results) queries to enwiki). Most of the time, I would
expect queries sent to the wrong wiki to fail (though
there are some exceptions)—but a query in English that
does get hits in rowiki is going to just look wrong most
of the time.
There are several proposals for improving language
detection in the etherpad, and we can work on them in
parallel, since any given one could be better than any
other one. (We don't want to make 100 of them, but a few
to test and compare would be nice—there may also be
reasonable speed/accuracy tradeoffs to be made, e.g., 2%
decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of
getting it. The easy, lower-quality way is just take
queries from a given wiki and assume they are in the
language in question (i.e., eswiki queries are in
Spanish). Easy, not 100% accurate, unlimited supply. The
hard, higher-quality way is to hand annotate a corpus of
queries. This is slow, but doable. I can do on the order
of 1000 queries in a day—more if I were less accurate and
more willing to toss stuff into the junk pile. I couldn't
do it for a week straight, though, without going crazy. A
possible middle of the road approach would be to create a
feedback loop and run detectors on our training data and
review and remove items that are not in the desired
language (we could also start by filtering things that are
not in the right character set, like removing all Arabic,
Cyrillic, and Chinese from enwiki, frwiki, and eswiki
queries). If we want thousands of hand-annotated queries,
we need to get annotating!
I think we can use the relevance lab to help evaluate a
language detector (at least with respect to zero results
rate). We could run the detector against a pile of
zero-results queries, then group the queries by detected
language, and run them against the relevant wiki (if we
have room in labs for the indexes, and we update the
relevance lab tools to support choosing a target wiki to
search). We wouldn't be comparing "before" and
"after",
but just measuring the zero results rate against the
target wiki. As any time we're using zero-results rate,
there's no guarantee that we'll be giving good results,
just results (e.g., "unix time stamp" queries with English
words fail on enwiki but sometimes work on zhwiki for some
reason, but that's not really better.)
I'm somewhat worried about being able to reduce the
targeted zero results rate by 10%. In my test[1], only 12%
of non-DOI zero-results queries were "in a language", and
only about a third got results when searched in the
"correct" (human-determined) wiki. I didn't filter bots
other than the DOI bot, and some non-language queries
(e.g., names) might get results in another wiki, but there
may not be enough wiggle room. There's a lot of junk in
other languages, too, but maybe filtering bots will help
more than I dare presume.
[1]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_…
<https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries>
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson
<ebernhardson(a)wikimedia.org
<mailto:ebernhardson@wikimedia.org>> wrote:
It measures the zero results rate for 1 in 10 search
requests via CirrusSearchUserTesting log that we used
last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes
<okeyes(a)wikimedia.org <mailto:okeyes@wikimedia.org>>
wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson
<ebernhardson(a)wikimedia.org
<mailto:ebernhardson@wikimedia.org>> wrote:
Now that we have the feature deployed (behind a
feature flag), and have an
initial "does it do anything?" test
going out
today, along with an upcoming
integration with our satisfaction metrics, we
need to come up with how will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a
user's search
query.
Adjusts results to match that language.
Determine from A/B test results whether this
feature is fit to
push to
production, with the aim to:
Improve search user satisfaction by 10% (from
15% to 16.5%).
Reduce zero results rate for non-automata search
queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we
should
prioritize. We might
want to take into consideration which of these
can be pre-tested with our
relevancy lab work, such that we can prefer to
work on things we think will
move the needle the most. I'm really not sure
which of these to push forward
on, so let us know which you think can have the
most impact, or where the
expected impact could be measured with relevancy
lab with minimal work.
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
<mailto:discovery@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/discovery
--
Oliver Keyes
Count Logula
Wikimedia Foundation
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
<mailto:discovery@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
<mailto:discovery@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
<mailto:discovery@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
<mailto:discovery@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery