如何解决熊猫:仅当另一列中的值匹配时才计算行之间的重叠单词
import pandas as pd
data = {'intent': ['order_food','order_food','order_taxi','order_call','order_taxi'],'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],'key_words': [['need','hamburger'],['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data,columns = ['intent','Sent','key_words'])
我使用下面的代码(不是我的解决方案)计算了 jaccard 相似度:
def lexical_overlap(doc1,doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
并修改 @Amit Amola 给出的代码以比较每两行之间重叠的单词并从中创建一个数据框:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)),2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
由于我的数据集很大,当我运行此代码来比较所有行时,它需要很长时间。所以我想只比较具有相同意图的句子,而不比较具有不同意图的句子。我不确定如何继续这样做
解决方法
IIUC 您只需要遍历 intent
列中的唯一值,然后使用 loc
来获取与之对应的行。如果您有两行以上,您仍然需要使用 combinations
来获取相似意图之间的唯一 combinations
。
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,["Sent"]].Sent.to_list()
combos = combinations(rows,2)
for combo in combos:
x,y = rows
overlap = lexical_overlap(x,y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
,
好的,所以我想出了如何根据@gold_cy 的回答在评论中提到我想要的输出:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows,2)
for combo in combos:
x,y = rows
overlap = lexical_overlap(x[1],y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。