[Wikimedia-search] Page rank - Discovery

21 Sep 2015

Late last week while looking over our existing scoring methods i was
thinking that while counting incoming links is nice, a couple guys
dominated search with (among other things) a better way to judge the
quality of incoming links, aka PageRank.

PageRank takes a very simple input, it just needs a list of all links
between pages. We happen to already store all of these in elasticsearch. I
wrote a few scripts to suck out the full enwiki graph (~400M edges), ship
it over to stat1002, throw it into hadoop, and crunch it with a few hundred
cores. The end result is a score for every NS_MAIN page in enwiki based on
the quality of incoming links.

I've taken these calculated pagerank's and used them as the scoring method
for search-as-you-type for http://en-suggesty.wmflabs.org.

Overall this seems promising as another scoring metric to integrate to our
search results. Not sure yet how to figure out things like how much weight
does pagerank have in the score? This might be yet another thing where
building out our relevance lab would enable us to make more informed
decisions.

Overall i think some sort of pipeline from hadoop into our scoring system
could be quite useful.  The initial idea seems to be to crunch data in
hadoop, stuff it into a read-only api, and then query it back out at
indexing time in elasticsearch to be held within the ES docs. I'm not sure
what the best way will be, but having a simple and repeatable way to
calculate scoring info in hadoop and ship that into ES will probably become
more and more important.