微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

精度作为信息检索的指标

如何解决精度作为信息检索的指标

我正在研究信息检索(即,用户搜索某些内容,然后算法根据预先生成的嵌入返回最接近的命中。我有所有论文的语料库。这个想法是它返回最靠前的论文最接近用户查询)。

目前,我有召回指标。在这种情况下计算精度指标是否有意义? 我非常密切地关注本教程:https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_hnswlib.py(这里也没有将精度作为指标)

def search_queries():
    print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
    input_query = input("Enter a sentence: ")
    query_embedding = model.encode(input_query)

    # hnswlib knn_query method is used to find the top_k_hits
    # knn_query make a batch query for k closest elements for each element of the query
    corpus_ids,distances = index.knn_query(query_embedding,k=top_k_hits)

     # extract corpus ids and scores for the first query
    hits = [{'corpus_id': id,'score': 1-score} for id,score in zip(corpus_ids[0],distances[0])]
    hits = sorted(hits,key=lambda x: x['score'],reverse=True)

    print("Input query:",input_query)
    for hit in hits[0:top_k_hits]:
        print("\t{:.2f}\t{}".format(hit['score'],corpus_sentences[hit['corpus_id']]))   
    print("\n")
    

    # Approximate Nearest Neighbor (ANN) is not exact,it might miss entries with high cosine similarity
    # Here,we compute the recall and precision of ANN compared to the exact results
    correct_hits = util.semantic_search(query_embedding,corpus_embeddings,top_k=top_k_hits)[0] 
    correct_hits_ids = set([hit['corpus_id'] for hit in correct_hits]) # get the id of relevant documents

    retrieved_hits_ids = set([hit['corpus_id'] for hit in hits]) #set(hit['corpus_id'] for hit in hits[0:top_k_hits])
    if len(retrieved_hits_ids) != len(correct_hits_ids):
        print("Approximate Nearest Neighbor returned a different number of results than expected")
    
    
    # Precision 
    # Percentage of all relevant documents that is returned by search
    precision = len(retrieved_hits_ids.intersection(correct_hits_ids)) / len(retrieved_hits_ids) 
                # correct results / all returned results         
    print("\nPrecision: {:.2f}".format(top_k_hits,precision * 100))
      
        
    # Recall
    # Percentage of relevant documents that are successfully retrieved 
    recall = len(retrieved_hits_ids.intersection(correct_hits_ids)) / len(correct_hits_ids) 
                # correct results / all results should be returned 
    print("\nRecall: {:.2f}".format(top_k_hits,recall * 100))

    if recall < 1:
        print("Missing relevant results:")
        for hit in correct_hits[0:top_k_hits]:
            if hit['corpus_id'] not in retrieved_hits_ids:
                print("\t{:.2f}\t{}".format(hit['score'],corpus_sentences[hit['corpus_id']]))
    print("\n\n=======\n")

使用我为精确度编写的上述代码,精确度和召回分数始终相同,我不知道如何解决这个问题。任何建议或见解将不胜感激。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。