如何解决在python中查找最相似的句子
我有一个超过1500行的数据。每行都有一个句子。我正在尝试找出在所有句子中查找最相似句子的最佳方法。
我尝试过的事情
-
我尝试了K-mean算法,该算法将相似的句子分组。但是我发现了一个缺点,即必须通过 K 创建集群。很难猜出 K 。我尝试了elbo方法来猜测集群,但将它们分组在一起是不够的。通过这种方法,我将所有数据分组。我正在寻找类似的数据,高于0.90%的数据应返回ID。
-
我尝试了余弦相似度,其中我使用
print("[LAS|" + substr[0] + "|G" + substr[1] + "|" + substr[2] + "|" + substr[3] + "|<CR>|]")
创建矩阵,然后传入了余弦相似度。即使这种方法也无法正常工作。
我在寻找什么
我希望我可以通过这样一种方法来返回阈值示例,在所有相似且高于0.90%的行中,返回0.90数据。
TfidfVectorizer
预期结果
以上类似的数据(最高可达0.90%)应通过 ID
获得Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
解决方法
为什么余弦相似性和TFIDF矢量化器对您不起作用?
我尝试了它,并且可以使用以下代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(columns=["ID","DESCRIPTION"],data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],[11,"MAXPREDO Validation is corect"],[12,"Move to QC"],[13,[14,"MAXPREDO Validation is right"],[15,"Verify files are sent every hours for this interface from Optima"],[16,"MAXPREDO Validation are correct"],[17,[18,"Verify files are not sent"]
]))
corpus = list(df["DESCRIPTION"].values)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
threshold = 0.4
for x in range(0,X.shape[0]):
for y in range(x,X.shape[0]):
if(x!=y):
if(cosine_similarity(X[x],X[y])>threshold):
print(df["ID"][x],":",corpus[x])
print(df["ID"][y],corpus[y])
print("Cosine similarity:",cosine_similarity(X[x],X[y]))
print()
阈值也可以调整,但是不能以0.9的阈值产生所需的结果。
阈值为0.4的输出为:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
在0.39的阈值下,所有预期的句子都是输出中的特征,但也可以找到另外一对索引为[15,18]的句子:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
,
可以使用这个 Python 3 库来计算句子相似度:https://github.com/UKPLab/sentence-transformers
来自 https://www.sbert.net/docs/usage/semantic_textual_similarity.html 的代码示例:
from sentence_transformers import SentenceTransformer,util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside','A man is playing guitar','The new movie is awesome']
sentences2 = ['The dog plays in the garden','A woman watches TV','The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1,convert_to_tensor=True)
embeddings2 = model.encode(sentences2,convert_to_tensor=True)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1,embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],sentences2[i],cosine_scores[i][i]))
该库包含最先进的句子嵌入模型。
请参阅 https://stackoverflow.com/a/68728666/395857 以执行句子聚类。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。