Hey Wikimedia-search!
I’m Trey Jones, and I’m a new to WMF (this is only my third week), and I
started this thread, though David really got it going.
There’s lots to digest here, and I’m sure I’ll retread certain ground
already covered, but below are my initial thoughts. Let me know if you
think any of these notes should end up in a wiki or Phab ticket
somewhere—I'm still trying to grok where to best document things. (And
think about everyone's comments, too, and whether they should be copied
elsewhere—it’s always a shame to lose track of good ideas.)
=Meta stuff=
Sorry this message is so long. I didn’t have time to write a short one.
(Alas, this is my greatest weakness, but at least I can admit it.)
I’ve tried to label ideas that could use some additional discussion with
(L)etters at the beginning of the first relevant paragraph.
=Results from other wikis=
I agree with the general consensus that n-grams aren’t great for language
detection on short strings. A quick skim of literature related to Oliver’s
cite (Kolkus and Rehurek 2009) points to Naive Bayes as a good method on
short strings.
I did notice that the slides attached to the old Cybozu lang-detect project
home page mention that short strings are a problem—but the slides are from
2010. David also mentioned that in his comments on T104505. Is Cybozu
lang-detect still a contender? Has anyone had a chance to run either the
latest version or the ES plugin on anything?
(A) I like the idea of running a cross-wiki test, though I can think of a
couple more ways to analyze the results than listed in T104505. I assume
there are plenty of repeats in the top-N “no-results” queries, and probably
a Zipf/power law distribution. (I’m very curious to see what the
distribution actually looks like. What’s the max frequency / percentage
over a day for a given zero-results query?)
So, it would make sense to me to track not only raw numbers, but also
weighted numbers if the distribution in the top-N is very unequal.[1] And
of course, the “zero result” decrease should be weighted. It might also
make sense to look at the distribution of “zero result decrease” by number
of additional wiki’s searched. For example, what if all 234 results from
the French wiki for English queries (in David’s example table in T104505)
are subsumed by the 324 German wiki results. Is it still worth searching in
French?
[1] Caveat: it wouldn’t hurt to review the very top queries in any sample
by hand to look for trending topics that could skew the results over a
small time period. During the Women’s World Cup, I bet there were more
searches for names of various players, for example, than there normally
would be.
On the other hand—I read French much better than I read German—so I’d
prefer French results even if all the French results are duplicates of the
German results. Are results in a language I can’t read really any better
than no results?
This leads to a few new (to me) ideas:
(B) Make multilingual results configurable—If we know, say, the top four
wikis likely to give good results for queries from the English wiki are Sp,
Fr, DE, and JP, we could have a expanding section (excuse an UI
ugliness—someone with UI smarts can help us figure out how to make it
pretty, right?) to enable multi-lingual searching, so on English Wikipedia
I could ask for “back up results” in Spanish and French, but not German and
Japanese. Store those settings in a cookie for later, too, possibly with
some UI indicator that multilingual backup results are enabled. (Also, if
the cookie is available at query time, we could save unnecessary cross-wiki
searches the user couldn’t possibly use.)
(C) And/or, multilingual results could be an extra click—“we didn’t find
English wiki results, but we found results that match your query in Spanish
and German, would you like to see them?” with links on “Spanish” and
“German”. I’d click the Spanish link, not the German link.
(D) Another sneakier idea that came to mind—which may not be technically
plausible—would be to find good results in another language and then check
for links back to wiki articles in the wiki the search came from. I do this
manually when I find something Google translate can’t handle in a
confidence-inspiring way: I search on Russian or Arabic Wikipedia, then
look on the nav bar for the “English” link. There are lots of options
here—showing just the English results with a link back to the language it
went through, or showing summaries for both, etc.
A silly example: search for “Виллальверния” in en wiki gives no results.
But there is a ru wiki page with that exact title. It has a link to the
English wiki page for “Villalvernia”. (Don’t ask why someone is searching
for the Russian name of a tiny Italian commune on the English Wikipedia.
The answer is “because multilingulaism”.)
Search: Виллальверния
Results: Villalvernia (crosswiki link from *Виллальверния*)
(E) Another simpler idea than language detection would be basic character
set detection. A query in Cyrillic might get better results from the
Russian, Ukrainian, and Bulgarian wikis than the French and German ones,
even if French and German do better overall. Similarly Arabic script and
perhaps the Arabic, Persian, and Urdu wikis.
This might also be a reason why decent language detection is okay if it is
computationally much cheaper than excellent detection—we don’t have to
commit to “the one true answer”; maybe we could search the top two or three
other wikis.
=Misspellings=
(F) I had a good chat with Erik earlier this afternoon, and I just
mentioned his “saerch” example that’s in T104468. Having recently looking
at the ES suggester docs at David’s suggestion, I asked Erik about the
prefix length… he was able to quickly find that it’s set to 2.. so only
words that start with the two letters “sa” could ever be suggested. As Erik
suggested in T104468, this would be a great less-performant option to try
if we get no results (or crappy results)—we could loosen the params, for
example going back to prefix=1. For zero results, this may make sense—but
the old suggestion Erik noted, *saeqeh,* and the current one, *samech,*
both seem kinda unlikely—we could probably quantify that, esp. with some
user feedback.
And we should definitely look at the various params and decide what are
reasonable settings for “cheap and good” and what’s “more expensive but
better”.
David’s idea of a spelling dictionary makes sense, in that it limits the
scope of possibilities to compare against. But it probably won’t handle
names, or, probably, technical terms (e.g., “phonestheme”—or, in hard mode,
its plural).
It would be interesting to see the results of dropping the long tail from
what ES considers a match—min_doc_freq (
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-suggeste…
) would help with that.
(How concerned are we with finding spelling errors in the wiki based on a
properly spelled search term? I used hunt for and correct commonly
misspelled words in en wiki as a hobby.)
=Misc=
(G) Another interesting question: if we end up implementing several option
for improving search results, we will have to figure out how to stage them
and in what order to try/test for them.
And of course almost all of these will make more sense once we've looked at
some query data. That's my next task—to get access myself and start trying
to decide what seems most likely to have most impact.
Okay.. I’m running out of steam a little, so I’m going to wrap it up for
now. I’ll think more about David’s comments on the three Epics and maybe
some other replies later.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Wed, Jul 22, 2015 at 2:57 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
This thread started between a few of us, but has some
good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).
Erik B
---------- Forwarded message ----------
From: Oliver Keyes <okeyes(a)wikimedia.org>
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse <dcausse(a)wikimedia.org>
Cc: Trey Jones <tjones(a)wikimedia.org>rg>, Erik Bernhardson <
ebernhardson(a)wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
On 22 July 2015 at 10:55, David Causse
<dcausse(a)wikimedia.org> wrote:
> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>
>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>> good to be CCd :).
>>
>> FWIW, this kind of detail on the specific things we're doing is
>> missing from the main search mailing list and could be used very much
>> there to inform people.
>
>
> I agree, my intent right now is still to learn from each others and
> build/use a friendly environment where engineer with NLP background like
> Trey can work efficiently. When things will be clearer it'd be great to
> share our plan.
>
>>
>> Oliver is already handling the executor IDs and distinguishing full
>> and prefix search, so nyah ;p.
>
> Great!
>
> Just to be sure : does this means that a search count will be reduced
to its
> executorID :
> - all request with the same executorID return 0 zero result -> add 1 to
the
> zero result counter
> - if one of the request returns a result -> do not increment the zero
result
counter
If yes I think this will be the killer patch for Q1 :)
Executor IDs are stored and if a match is found in executor IDs <=120
seconds after that one, the later outcome is considered "the outcome".
If not, we assume no second round-trip was made and so go with
whatever happened first.
So if you make a request and it round-trips once and fails, failure.
Round-trip once and succeeds, success. Round-trip twice and fail both
times, failure. Round-trip twice and fail the first time and succeed
the second - one success, zero failures :). Erik wrote it, and I grok
the logic.
>> On the language detection - actually
>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>> of text really really well (n-gram based approaches /suck at this/)
>> and there's a Java implementation I've been playing with. Want me to
>> run it across some search strings and we can look at the results? Or
>> just send the code across.
>
> If you ask I'd say both! ;)
>
> We evaluated this kind of dictionary-based language detection (but this
not
> this one specifically), problem for us was
mostly due to performance: it
> takes time to tokenize the input string correctly and the dictionary we
used
> was rather big. But we worked mainly on large
content (webnews, press
> articles).
> In our case input strings should be very small so it makes more sense.
We
> should be able to train the dictionary
against the "all title in ns0"
dumps
> though.
>
> This is also a great example to explain why I feel stuck sometimes:
> How will you be able to test it?
> - I'm not allowed to download search logs locally.
> - I think I won't be able to install java and play with this kind of
tools
on
fluorine.
Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
right? If yes to all three, I don't see a problem with me squirting
you a sample of logs (and the Java). I figure if we find the
methodology works we can look at speedups to the code, which is a lot
easier a task than looking at fast code and trying to improve the
methodology.
> Another point:
> concerning the following tasks described below, I think it overlaps
> analytics tasks (because it's mainly related to learning from search
logs).
> I don't know how you work today and maybe
this is something you've
already
> done or is obviously wrong.
> I think you're one of the best person today to help us to sort this
out,
so
> your feedback concerning the following lines
will be greatly
appreciated :)
Thanks!
Yes! Okay, thoughts on the below:
1. Build a search log parser - we sort of have that through the
streaming python script. It depends whether you mean a literal parser
or something to pick out all the "important" bits. See point 4.
2. Big machine: I'd love this. But see point 4.
3. Improve search logs for us: when we say improve for us do we mean
for analytics/improvements purposes? Because if so we've been talking
about having the logs in HDFS which would make things pretty easy for
all and sundry and avoid the need for a parser.
One way of neatly handling all of this would be:
1. Get the logs in a format that has the fields we want and stream it
into Hadoop. No parser necessary.
2. Stick the big-ass machine in the analytics cluster, where it has
default access to Hadoop and can grab data trivially, but doesn't have
to break anyone else's stuff.
3. Fin.
What am I missing? Other than "setting up a MediaWiki kafka client is
going to be kind of a bit of work".
>>>
>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>
>>>> It's still not very clear in my mind but things could look like :
>>>>
>>>> * Epic: Build a toolbox to learn from search logs
>>>> - Create a script to run search queries against the production
>>>> index
>>>> - Build search logs parser that provide all the needed details :
>>>> time,
>>>> search type, wiki origin, target search index, search query, search
>>>> query
>>>> ID, number of results, offset of the results (search page)
>>>> (side note : Erik will it be possible to pass the queryID
from
>>>> page to page when user clicks
"next page"?)
>>>> - Have a descent machine (64g RAM would be great) in the
production
>>>> cluster where we can
>>>> - download production search logs
>>>> - install the tools we want
>>>> - stress it not being afraid to kill it
>>>> - do all the stuff we want to learn from data and search
logs
>>>>
>>>> * Epic: Improve search logs for us
>>>> - Add an "incognito parameter" to cirrus that could be
used by
the
>>>> toolbox script not to pollute our
search logs when running our
"search
>>>> script".
>>>> - Add a log when the user click on a search result to have a
>>>> mapping
>>>> between the queryID, the result choosen and the offset of the chosen
>>>> link in
>>>> the result list.
>>>> - This task is certainly complex and highly depends on the
>>>> client,
>>>> I don't know if we will be able to track this down on all clients
but
>>>> it'd
>>>> be great for us.
>>>> - More things will be added as we learn
>>>>
>>>> * Epic: start to measure and control relevance
>>>> - Create a corpus of search queries for each wiki with their
>>>> expected
>>>> results
>>>> - Run these queries weekly/monthly and compute the F1-Score for
>>>> each
>>>> wiki
>>>> - Continuously enhance the search queries corpus
>>>> - Provide a weekly/monthly perf score for each wiki
>>>>
>>>> As you can see this is mostly about tools, I propose to start with
batch
>> tools and think later of how we could make
this more real-time.
>>
>>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search