如何解决Python/Pandas:如何使用 FuzzyWuzzy 用国家/地区名称替换列中的拼写错误?
我有一个包含大约 50 万行的数据框,其中包含一个名为 country
的列等。我的目标是替换 country
列的条目有不同拼写错误的所有可能值。
例如:
import pandas as pd
# Starting dataset:
d = {'country': ['Unites Sates','United state','Cnda','canada','United State','United sates of America','Mexio','mexico','Mejico','America','U.S.A.','UsA of A','cAnada','u. s. a. ','United States of America']}
df = pd.DataFrame(data=d)
df
country
0 Unites Sates #wants to replace
1 United state #wants to replace
2 Cnda #wants to replace
3 canada #wants to replace
4 United State #wants to replace
5 United sates of America #wants to replace
6 Mexio #wants to replace
7 Mexico #wants to replace
8 Mejico #wants to replace
9 America #wants to replace
10 U.S.A. #wants to replace
11 UsA of A #wants to replace
12 cAnada #wants to replace
13 u. s. a. #wants to replace
14 United States of America
# Expected Outcome:
d = {'country': ['United States of America','United States of America','Canada','Mexico','United States of America']}
df = pd.DataFrame(data=d)
df
country
0 United States of America #replaced
1 United States of America #replaced
2 Canada #replaced
3 Canada #replaced
4 United States of America #replaced
5 United States of America #replaced
6 Mexico #replaced
7 Mexico #replaced
8 Mexico #replaced
9 United States of America #replaced
10 United States of America #replaced
11 United States of America #replaced
12 Canada #replaced
13 United States of America #replaced
14 United States of America
我尝试的其中一件事是创建一个名为 correct_countries_df
的列表,其中包含正确的国家/地区名称并将其用作:
df['country_BestMatch'] = df['country'].map(lambda x: process.extractOne(x,correct_countries_df['country'])[0])
但似乎我做不到。
有什么想法吗?
提前致谢!
解决方法
如果您的 correct_countries_df
看起来像:
>>> correct_countries_df
country
0 United States of America
1 Canada
2 Mexico
那么,你的代码是正确的
>>> df['country'].map(lambda x: process.extractOne(x,correct_countries_df['country'])[0])
0 United States of America
1 United States of America
2 Canada
3 Canada
4 United States of America
5 United States of America
6 Mexico
7 Mexico
8 Mexico
9 United States of America
10 United States of America
11 United States of America
12 Canada
13 United States of America
14 United States of America
Name: country,dtype: object
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。