Sorry, got myself confused here. Page rank and page views are different
concepts; I was thinking about a related task
<https://phabricator.wikimedia.org/T113439> to use page views to improve
scoring when writing this.
Thanks,
Dan
On 24 September 2015 at 20:22, Dan Garry <dgarry(a)wikimedia.org> wrote:
Thanks for the summary, Erik! It sounds very promising
to me, and logical
that we should use page views to affect the weight of the results. But, of
course, we should be careful that we don't weight the page views so high
that we end up giving the system criticality and creating a positive
feedback loop, where random fluctuations in page views push up irrelevant
results in the scoring, which gets them more page views, which pushes it up
further, and so on.
Dan
On 21 September 2015 at 08:07, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
Late last week while looking over our existing
scoring methods i was
thinking that while counting incoming links is nice, a couple guys
dominated search with (among other things) a better way to judge the
quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links
between pages. We happen to already store all of these in elasticsearch. I
wrote a few scripts to suck out the full enwiki graph (~400M edges), ship
it over to stat1002, throw it into hadoop, and crunch it with a few hundred
cores. The end result is a score for every NS_MAIN page in enwiki based on
the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring
method for search-as-you-type for
http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to
our search results. Not sure yet how to figure out things like how much
weight does pagerank have in the score? This might be yet another thing
where building out our relevance lab would enable us to make more informed
decisions.
Overall i think some sort of pipeline from hadoop into our scoring system
could be quite useful. The initial idea seems to be to crunch data in
hadoop, stuff it into a read-only api, and then query it back out at
indexing time in elasticsearch to be held within the ES docs. I'm not sure
what the best way will be, but having a simple and repeatable way to
calculate scoring info in hadoop and ship that into ES will probably become
more and more important.
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation