Recently I've been doing some investigation into how we can collect enough data to plausibly train an ML model for search re-ranking. As with all ML training, the labeled dataset to train against is an important piece.  Many approaches seem to use human labeled relevance, and we have a platform for collecting this data which has proven to have decent predictive capabilities for offline tests of changes to our search. But the amount of data necessary for training ML models is simply not there.

In my research i've come across a paper "A Dynamic Bayesian Network Click Model for Web Search Ranking"[1] and related implementation[2] that seems to have some promise. Machine generation of relevance labels seems promising, because i can collect a reasonable amount of information about clickthroughs and the search results that were provided to users.

For one week of enwiki traffic i have ~20k queries that were issued by more than 10 identities (~distinct search session). This has around 135k distinct (query, identity) pairs, 140k distinct (query, identity, click page id) pairs, 414k distinct (query, result page id) pairs, and covers ~3M results (~20 per page) that were shown to users and could be converted into relevance judgements.  I'm not sure which way to train the final model on though, the 414k distinct (query, result_page_id) pairs, or the 3M which has duplicates from the 414k representing the  same (query, result_page_id) pair being shown multiple times.


I was also curious about a part in the appendix of the paper, labeled Confidence. It states:

Remember that the latent variables au and su will later be used as targets for learning a ranking function. It is thus important to know the confidence associated with these values

Why is it important to know the confidence, and how does that play into training a model? This is probably basic ML stuff but I'm new to all of this.
 
And finally, are there better ways of generating relevance labels from clickthrough data, ideally with open source implementations? This is just something I happened to stumble upon in my research and certainly not the only thing out there.

[1] http://www2009.eprints.org/1/1/p1.pdf
[2] https://github.com/varepsilon/clickmodels