如何解决精度作为信息检索的指标
我正在研究信息检索(即,用户搜索某些内容,然后算法根据预先生成的嵌入返回最接近的命中。我有所有论文的语料库。这个想法是它返回最靠前的论文最接近用户的查询)。
目前,我有召回指标。在这种情况下计算精度指标是否有意义? 我非常密切地关注本教程:https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_hnswlib.py(这里也没有将精度作为指标)
def search_queries():
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
input_query = input("Enter a sentence: ")
query_embedding = model.encode(input_query)
# hnswlib knn_query method is used to find the top_k_hits
# knn_query make a batch query for k closest elements for each element of the query
corpus_ids,distances = index.knn_query(query_embedding,k=top_k_hits)
# extract corpus ids and scores for the first query
hits = [{'corpus_id': id,'score': 1-score} for id,score in zip(corpus_ids[0],distances[0])]
hits = sorted(hits,key=lambda x: x['score'],reverse=True)
print("Input query:",input_query)
for hit in hits[0:top_k_hits]:
print("\t{:.2f}\t{}".format(hit['score'],corpus_sentences[hit['corpus_id']]))
print("\n")
# Approximate Nearest Neighbor (ANN) is not exact,it might miss entries with high cosine similarity
# Here,we compute the recall and precision of ANN compared to the exact results
correct_hits = util.semantic_search(query_embedding,corpus_embeddings,top_k=top_k_hits)[0]
correct_hits_ids = set([hit['corpus_id'] for hit in correct_hits]) # get the id of relevant documents
retrieved_hits_ids = set([hit['corpus_id'] for hit in hits]) #set(hit['corpus_id'] for hit in hits[0:top_k_hits])
if len(retrieved_hits_ids) != len(correct_hits_ids):
print("Approximate Nearest Neighbor returned a different number of results than expected")
# Precision
# Percentage of all relevant documents that is returned by search
precision = len(retrieved_hits_ids.intersection(correct_hits_ids)) / len(retrieved_hits_ids)
# correct results / all returned results
print("\nPrecision: {:.2f}".format(top_k_hits,precision * 100))
# Recall
# Percentage of relevant documents that are successfully retrieved
recall = len(retrieved_hits_ids.intersection(correct_hits_ids)) / len(correct_hits_ids)
# correct results / all results should be returned
print("\nRecall: {:.2f}".format(top_k_hits,recall * 100))
if recall < 1:
print("Missing relevant results:")
for hit in correct_hits[0:top_k_hits]:
if hit['corpus_id'] not in retrieved_hits_ids:
print("\t{:.2f}\t{}".format(hit['score'],corpus_sentences[hit['corpus_id']]))
print("\n\n=======\n")
使用我为精确度编写的上述代码,精确度和召回分数始终相同,我不知道如何解决这个问题。任何建议或见解将不胜感激。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。