To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint	count
alpha >= 0.80	11
0.667 <= alpha < 0.80	18
0.500 <= alpha < 0.667	20
0.333 <= alpha < 0.500	26
0 <= alpha < 0.333	43
alpha < 0	31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

On Thu, Oct 27, 2016 at 7:51 AM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:

Thanks for the links! This is exactly what I was looking for. After reviewing some of the options I'm going to do a first try with Krippendorff's Alpha. It's ability to handle missing data from some graders as well as being applicable down to n=2 seems promising.

On Oct 26, 2016 11:37 AM, "Justin Ormont" <justin.ormont@gmail.com> wrote:
You're in the area of: https://en.wikipedia.org/wiki/Inter-rater_reliability

--justin

On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan <jmorgan@wikimedia.org> wrote:
Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

- Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:
For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:
* It is probably important to include not just that the users gave different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.
* If the users agree that 80% of the results are all 0, but disagree on the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF)

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery