我必须根据名称对某些数据进行交叉验证.
我面临的问题是,根据来源,名称有轻微的变化,例如:
L & L AIR CONDITIONING vs L & L AIR CONDITIONING Service
BEST ROOFING vs ROOFING INC
我有几千条记录,所以手动操作会非常耗时,我希望尽可能自动化这个过程.
由于还有其他单词,因此小写名称是不够的.
哪个是处理这个问题的好算法?
也许计算相关性给予“INC”或“服务”等词语较低的权重
编辑:
我尝试了difflib库
difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()
我用它获得了不错的结果.
解决方法:
我会用余弦相似来达到同样的效果.它会给你一个匹配得分,表示字符串的接近程度.
下面的代码可以帮助您(我记得几个月前从Stackoverflow本身获取此代码 – 现在找不到链接)
import re, math
from collections import Counter
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
return Counter(WORD.findall(text))
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514
我发现有用的另一个版本是基于NLP的,我创作了它.
import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
WORD = re.compile(r'\w+')
stemmer = Porterstemmer()
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
a = []
for i in words:
for ss in wn.synsets(i):
a.extend(ss.lemma_names())
for i in words:
if i not in a:
a.append(i)
a = set(a)
w = [stemmer.stem(i) for i in a if i not in stop]
return Counter(w)
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
def get_char_wise_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
s = []
for i in a:
for j in b:
s.append(get_similarity(str(i), str(j)))
try:
return sum(s)/float(len(s))
except: # len(s) == 0
return 0
get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068
您可以调用get_similarity或get_char_wise_similarity来查看更适合您的用例的内容.我使用了两者 – 正常的相似性与非常接近的杂草,然后是明显的相似性,以排除足够接近的相似性.然后其余的必须手动处理.
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。