当结果标签为变量长度时，如何在文档重复数据删除任务中获得准确度、召回率、精度等？

如何解决当结果标签为变量长度时，如何在文档重复数据删除任务中获得准确度、召回率、精度等？

所以我正在使用 LSH: Locality Sensitive Hashing 解决重复数据删除问题。

所以我知道文件的副本。让我们假设我们有类似的东西：

{'group1': [text1,text2,text3],'group2': [text4],'group3': [text5,text6],}

所以我可以很容易地得到重复项，就像 text1 是 text2、text3 的重复项，反之亦然。通过使用 Snapy python LSH package，它给我的结果类似：

{'text1': ['text1','text5'],'text2': []
'text4: ['text6']}

如您所见，也有误报（在 text1 中）和误报（在 text2 中）。我如何在这里定义一个新指标，如果我想根据超参数调整来评估该算法的工作，我可以得到一个增加或减少的数字？

我不能直接使用 Top-K，因为有些实例只有 1 个重复，有些实例超过 N。此外，在结果中，也有空，FP，FN。我什至不能使用 sklearn.metrics.accuracy 左右，因为我们有超过 1 个重复的每个实例。

对于为此找到一个指标的任何帮助表示赞赏。

更新：我想出了以下功能。如果这是正确的指标，有人可以尝试给我建议，如果是，我如何将这 3 个数字组合成一个数字？

def metric(actual:[set],predicted:[set]):
    len_true = len(actual)
    len_pred = len(predicted)
    
    TP = len(actual.intersection(predicted)) # common in both are True Positive
    FP = len(predicted.difference(a.intersection(b))) # Extra in predicted are False Positive
    Missing = len(actual.difference(a.intersection(b))) # Extra in actual are Missing
    
    TP = TP/len_pred if len_pred > 0 else 0 # increase this value.  between 0 and 1
    FP = FP/len_pred if len_pred > 0 else 0 # Decrease this value. IS between 0 and 1
    Missing = Missing/len_true # value is between 0-1 . Decrease this value
    
    return TP,FP,Missing