微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

比较熊猫两列中的字符串

如何解决比较熊猫两列中的字符串

我正在尝试确定熊猫数据框中两列的相似性:

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

我想将'Performance results ... ''The six...'和'Accuracy is one...''Where am I?'进行比较。 第一行在两列之间应具有较高的相似度,因为它包含一些单词。第二列应等于0,因为两列之间没有共同的词。

要比较我使用的SequenceMatcher的两列,如下:

from difflib import SequenceMatcher

ratio = SequenceMatcher(None,df.Text1,df.All).ratio()

但是使用df.Text1,df.All似乎是错误的。

你能告诉我为什么吗?

解决方法

  • SequenceMatcher不是为熊猫系列设计的。
  • 您可以.apply的功能。
  • SequenceMatcher Examples
    • 对于isjunk=None,即使空格也不被视为垃圾邮件。
    • 使用isjunk=lambda y: y == " "会将空格视为垃圾。
from difflib import SequenceMatcher
import pandas as pd

data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.','Accuracy is one of the basic principles of perfectionist.'],'All': ['The six top approaches and three others outperform the strong baseline.','Where am I?']}

df = pd.DataFrame(data)

# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1','All']].apply(lambda x: SequenceMatcher(lambda y: y == " ",x[0],x[1]).ratio(),axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235

# isjunk=None
df['ratio'] = df[['Text1','All']].apply(lambda x: SequenceMatcher(None,axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。