如何解决熊猫单列运行模糊模糊比
我有很多全名示例:
datafile.csv:
full_name,dob,Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012
我正在尝试使用fuzz.ration
来查看column ['fullname']中的名称是否有任何相似之处,但是代码要花很多时间,主要是因为嵌套了for循环。
示例代码:
dataframe = pd.read_csv('datafile.csv')
_list = []
for row1 in dataframe['fullname']:
for row2 in dataframe['fullname']:
x = fuzz.ratio(row1,row2)
if x > 90:
_list.append([row1,row2,x])
print(_list)
是否有更好的方法来迭代单个pandas列以获得潜在重复数据的比率?
谢谢 吉姆
解决方法
您可以创建第一个模糊数据:
import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz
data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")
df = pd.read_csv(data,names=['full_name'])
for index,row in df.iterrows():
df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'],x))
print(df.to_string())
输出:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
1 Morty Smith 73 100 26 76 91
2 Rick Sanchez 26 26 100 27 35
3 Jery Smith 95 76 27 100 67
4 Morti Smith 64 91 35 67 100
然后找到所选名称的最佳匹配项:
data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)
输出:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
3 Jery Smith 95 76 27 100 67
,
import pandas as pd
from io import StringIO
from fuzzywuzzy import process
s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""
df = pd.read_csv(StringIO(s))
# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply,which can be very slow
# 3 - convert the list comprehension results to a dataframe
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i],df[~df.index.isin([i])]['full_name'],limit=1)[0] for i in range(len(df))],index=df.index,columns=['match_name','match_percent','match_index'])
# join the new dataframe to the original
final = df.join(df2)
full_name dob match_name match_percent match_index
0 Jerry Smith 21/01/2010 Jery Smith 95 3
1 Morty Smith 18/06/2008 Morti Smith 91 4
2 Rick Sanchez 27/04/1993 Morti Smith 43 4
3 Jery Smith 27/12/2012 Jerry Smith 95 0
4 Morti Smith 13/03/2012 Morty Smith 91 1
,
此比较方法起着双重作用,因为在“杰里·史密斯”和“莫蒂·史密斯”之间运行模糊测试比与“莫里·史密斯”和“杰里·史密斯”之间的比率相同。
如果您遍历子数组,则可以更快地完成此操作。
dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
comparison_fullname = dataframe['fullname'][i_dataframe]
for entry_fullname,entry_score in process.extract(comparison_fullname,dataframe['fullname'][i_dataframe+1::],scorer=fuzz.ratio):
if entry_score >=90:
_list.append((comparison_fullname,entry_fullname,entry_score)
print(_list)
这将防止任何重复的工作。
,通常有两个部分可以帮助您提高性能:
- 减少比较量
- 使用更快的方式匹配字符串
在您的实现中,您执行了很多不需要的比较,因为您总是比较A B,然后再比较B A。您也比较A A,通常总是100。因此,您可以将比较量减少50%以上。由于您只想添加得分超过90的比赛,因此该信息可用于加快比较速度。尽管这无法在FuzzyWuzzy中完成,但可以在Rapidfuzz中完成(我是作者)。 Rapidfuzz在界面相对相似的情况下实现了与FuzzyWuzzy相同的算法,但是在性能上有很多改进。
可以通过以下方式实现您的代码,以实现这两个更改,这应该快得多。在我的计算机上测试此代码时,该代码的运行时间约为12秒,而此改进版本仅需要1.7秒。
import pandas as pd
from io import StringIO
from rapidfuzz import fuzz
# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,13/03/2012'''*500
dataframe = pd.read_csv(StringIO(s))
# only create the data series once
full_names = dataframe['fullname']
for index,row1 in full_names.items():
# skip elements that are already compared
for row2 in full_names.iloc[index+1::]:
# use a score_cutoff to improve the runtime for bad matches
score = fuzz.ratio(row1,row2,score_cutoff=90)
if score:
_list.append([row1,score])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。