微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

计算多标签注释的 Krippendorff Alpha

如何解决计算多标签注释的 Krippendorff Alpha

如何计算多标签注释的 Krippendorff Alpha? 在多类注释的情况下(假设 3 个编码员用 3 个标签注释了 4 个文本:a、b、c),我首先构建可靠性数据矩阵,然后是巧合,然后根据巧合我可以计算 Alpha:

enter image description here

问题是如何在多标签分类问题(如下例)的情况下准备巧合并计算 alpha?

enter image description here

Python 实现甚至 excel 将不胜感激。

解决方法

在寻找类似信息时遇到了您的问题。我们使用了以下代码,其中 nltk.agreement 表示指标,pandas_ods_reader 从 LibreOffice 电子表格中读取数据。我们的数据有两个注释者,对于某些项目可以有两个标签(例如,一个编码员只注释了一个标签,另一个编码员注释了两个标签)。

下面的电子表格屏幕截图显示了输入数据的结构。注释项的列称为annotItems,注释列称为coder1coder2。有多个标签时的分隔符是管道,与示例中的逗号不同。

代码的灵感来自这篇 SO 帖子:Low alpha for NLTK agreement using MASI distance

[Spreadsheet screencap]

from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance

import pandas_ods_reader as pdreader

annotfile = "test-iaa-so.ods"

df = pdreader.read_ods(annotfile,"Sheet1")

annots = []


def create_annot(an):
    """
    Create frozensets with the unique label
    or with both labels splitting on pipe.
    Unique label has to go in a list so that
    frozenset does not split it into characters.
    """
    if "|" in str(an):
        an = frozenset(an.split("|"))
    else:
        # single label has to go in a list
        # need to cast or not depends on your data
        an = frozenset([str(int(an))])
    return an


for idx,row in df.iterrows():
    annot_id = row.annotItem + str.zfill(str(idx),3)
    annot_coder1 = ['coder1',annot_id,create_annot(row.coder1)]
    annot_coder2 = ['coder2',create_annot(row.coder2)]
    annots.append(annot_coder1)
    annots.append(annot_coder2)

# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task,masi_task]
for task in tasks:
    task.load_array(annots)
    print("Statistics for dataset using {}".format(task.distance))
    print("C: {}\nI: {}\nK: {}".format(task.C,task.I,task.K))
    print("Pi: {}".format(task.pi()))
    print("Kappa: {}".format(task.kappa()))
    print("Multi-Kappa: {}".format(task.multi_kappa()))
    print("Alpha: {}".format(task.alpha()))

对于从此答案链接的屏幕截图中的数据,将打印:

Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1','coder2'}
I: {'item3002','item1000','item6005','item5004','item2001','item4003'}
K: {frozenset({'1'}),frozenset({'0'}),frozenset({'0','1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236

Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1','1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。