To follow up a little here, i implemented Krippendorff's Alpha and ran it
against all the data we currently have in discernatron, the distribution
looks something like:
constraint count
alpha >= 0.80 11
0.667 <= alpha < 0.80 18
0.500 <= alpha < 0.667 20
0.333 <= alpha < 0.500 26
0 <= alpha < 0.333 43
alpha < 0 31
This is a much lower level of agreement than i was expecting. The
literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from
which you can draw tentative conclusions. Below 0 indicates there is less
agreement than random chance, and we need to re-evaluate the instructions
to be more clear (probably true).
On Thu, Oct 27, 2016 at 7:51 AM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
Thanks for the links! This is exactly what I was
looking for. After
reviewing some of the options I'm going to do a first try with
Krippendorff's Alpha. It's ability to handle missing data from some graders
as well as being applicable down to n=2 seems promising.
On Oct 26, 2016 11:37 AM, "Justin Ormont" <justin.ormont(a)gmail.com>
wrote:
> You're in the area of:
https://en.wikipedia.org/wiki/
> Inter-rater_reliability
>
> --justin
>
> On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan <jmorgan(a)wikimedia.org>
> wrote:
>
>> Disclaimer: I'm not a math nerd, and I don't know the history of
>> Discernatron very well.
>>
>> ...but re: your second specialized concern, have you considered running
>> some more sophisticated inter-rater reliability statistics to get a better
>> sense of the degree of disagreement (controlling for random chance?). See
>> for example:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
>>
>> - Jonathan
>>
>> On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson <
>> ebernhardson(a)wikimedia.org> wrote:
>>
>>> For a little backstory, in discernatron multiple judges provide scores
>>> in from 0 to 3 for results. Typically we only request a single query to be
>>> reviewed by two judges. We would like to measure the level of disagreement
>>> between these two judges, and if it crosses some threshold get two more
>>> scores, so we can then measure disagreement in the group of 4. Somehow
>>> though, we need to define how to measure that level of disagreement and
>>> what the threshold for needing more scores is.
>>>
>>> Some specialized concerns:
>>> * It is probably important to include not just that the users gave
>>> different values, but also how far apart they are. The difference between a
>>> 3 and a 2 is much smaller than between a 2 and a 0.
>>> * If the users agree that 80% of the results are all 0, but disagree on
>>> the last 20%, even though the average disagreement is low it's probably
>>> still important? Might be worthwhile to take all the agreements about
>>> irrelevant results and remove them before calculating disagreement? Not
>>> sure...
>>>
>>> I know we have a few math nerds here on the list, so hoping someone has
>>> a few ideas.
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> discovery(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> discovery(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> discovery(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/discovery
>
>