Python/Pandas：如何使用 FuzzyWuzzy 用国家/地区名称替换列中的拼写错误？

如何解决Python/Pandas：如何使用 FuzzyWuzzy 用国家/地区名称替换列中的拼写错误？

我有一个包含大约 50 万行的数据框，其中包含一个名为 country 的列等。我的目标是替换 country 列的条目有不同拼写错误的所有可能值。

例如：

import pandas as pd
# Starting dataset:
d = {'country': ['Unites Sates','United state','Cnda','canada','United State','United sates of America','Mexio','mexico','Mejico','America','U.S.A.','UsA of A','cAnada','u. s. a. ','United States of America']}
df = pd.DataFrame(data=d)
df

                     country
0               Unites Sates #wants to replace
1               United state #wants to replace
2                       Cnda #wants to replace
3                     canada #wants to replace
4               United State #wants to replace
5    United sates of America #wants to replace
6                      Mexio #wants to replace
7                     Mexico #wants to replace
8                     Mejico #wants to replace
9                    America #wants to replace
10                    U.S.A. #wants to replace
11                  UsA of A #wants to replace
12                    cAnada #wants to replace
13                 u. s. a.  #wants to replace
14  United States of America


# Expected Outcome:
d = {'country': ['United States of America','United States of America','Canada','Mexico','United States of America']}
df = pd.DataFrame(data=d)
df

                     country
0   United States of America #replaced
1   United States of America #replaced
2                     Canada #replaced
3                     Canada #replaced
4   United States of America #replaced
5   United States of America #replaced
6                     Mexico #replaced
7                     Mexico #replaced
8                     Mexico #replaced
9   United States of America #replaced
10  United States of America #replaced
11  United States of America #replaced
12                    Canada #replaced
13  United States of America #replaced
14  United States of America

我尝试的其中一件事是创建一个名为 correct_countries_df 的列表，其中包含正确的国家/地区名称并将其用作：

df['country_BestMatch'] = df['country'].map(lambda x: process.extractOne(x,correct_countries_df['country'])[0])

但似乎我做不到。

有什么想法吗？

提前致谢！

解决方法

如果您的 correct_countries_df 看起来像：

>>> correct_countries_df

                    country
0  United States of America
1                    Canada
2                    Mexico

那么，你的代码是正确的

>>> df['country'].map(lambda x: process.extractOne(x,correct_countries_df['country'])[0])

0     United States of America
1     United States of America
2                       Canada
3                       Canada
4     United States of America
5     United States of America
6                       Mexico
7                       Mexico
8                       Mexico
9     United States of America
10    United States of America
11    United States of America
12                      Canada
13    United States of America
14    United States of America
Name: country,dtype: object

Python/Pandas：如何使用 FuzzyWuzzy 用国家/地区名称替换列中的拼写错误？

如何解决Python/Pandas：如何使用 FuzzyWuzzy 用国家/地区名称替换列中的拼写错误？

解决方法

相关推荐