提取段落中与列表中单词相似的单词

如何解决提取段落中与列表中单词相似的单词

我有以下字符串：

"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

要提取的单词列表：

["town","teddy","chicken","boy went"]

注意：town 和 teddy 在给定的句子中拼写错误。

我尝试了以下方法，但我得到了不属于答案的其他词：

import difflib

sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

list1 = ["town","boy went"]

[difflib.get_close_matches(x.lower().strip(),sent.split()) for x in list1 ]

我得到以下结果：

[['twn','to'],['tddy'],['chicken.','picked'],['went']]

代替：

'twn','tddy','chicken','boy went'

解决方法

文档中关于 difflib.get_closest_matches() 的通知：

difflib.get_close_matches(word,possibilities,n=3,cutoff=0.6)

返回最佳“足够好”匹配的列表。 word 是需要接近匹配的序列（通常是字符串），并且 possibilities 是与 word 匹配的序列列表（通常是字符串列表）。

可选参数n（默认3）是要返回的最大匹配数； n 必须大于 0。

可选参数 cutoff（默认 0.6）是 [0,1] 范围内的浮点数。得分至少不与 word 相似的可能性是忽略。

目前，您使用的是默认的 n 和 cutoff 参数。

您可以指定其中一个（或两者），以缩小返回的匹配范围。

例如，您可以使用 0.75 的 cutoff 分数：

result = [difflib.get_close_matches(x.lower().strip(),sent.split(),cutoff=0.75) for x in list1]

或者，您可以指定最多只返回 1 个匹配项：

result = [difflib.get_close_matches(x.lower().strip(),n=1) for x in list1]

在任何一种情况下，您都可以使用列表理解来展平列表列表（因为 difflib.get_close_matches() 总是返回一个列表）：

matches = [r[0] for r in result]

由于您还想检查二元组的接近匹配，您可以通过提取相邻“单词”的配对并将它们作为 difflib.get_close_matches() 参数的一部分传递给 possibilities 来实现。

这是一个完整的工作示例：

import difflib
import re

sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

list1 = ["town","teddy","chicken","boy went"]

# this extracts overlapping pairings of "words"
# i.e. ['The boy','boy went','went to','to twn',...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))',sent)

# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(),sent.split() + pairs,n=1) for x in list1]

matches = [r[0] for r in result]

print(matches)
# ['twn','tddy','chicken.','boy went']

如果你阅读了 difflib.get_close_matches() 的 Python 文档 https://docs.python.org/3/library/difflib.html 它返回所有可能的最佳匹配。方法签名： difflib.get_close_matches(word,可能性,cutoff=0.6)

这里 n 是要返回的最大匹配数。所以我认为你可以将其作为 1 传递。

>>> [difflib.get_close_matches(x.lower().strip(),1)[0] for x in list1]
['twn','went']

提取段落中与列表中单词相似的单词

如何解决提取段落中与列表中单词相似的单词

解决方法

相关推荐