比较两个 python pandas 数据框字符串列以识别公共字符串并将公共字符串添加到新列

如何解决比较两个 python pandas 数据框字符串列以识别公共字符串并将公共字符串添加到新列

我有以下两只熊猫 df：

df1:             df2:

item_name        item_cleaned

abc xyz          Def
xuy DEF          Ghi
s GHI lsoe       Abc
p ABc ois

我需要编写一个函数来比较 df2.item_cleaned 和 df1.item_name，以查看 df2.item_cleaned 中的字符串是否存在于 df1.item_name 中（不区分大小写）。

如果字符串存在（不考虑大小写），我想创建一个新列 df1.item_final 并在此新列中为每一行输入 df2.item_cleaned 字符串值。

输出应该是这样的：

df1:                                 df2:

item_name        item_final          item_cleaned

abc xyz          Abc                 Def
xuy DEF          Def                 Ghi
s GHI lsoe       Ghi                 Abc
p ABc ois        Abc

作为参考，我要清理的 df1 有 12 列和大约 400,000 行。

解决方法

创建一个映射 obj_map，键为 item_cleaned 的小写字母，值为 item_cleaned。
使用正则表达式提取 tem_cleaned，带有 re.IGNORECASE 标志
然后降低提取部分并用obj_map替换它得到item_final

import re
item_cleaned = df2['item_cleaned'].dropna().unique()
obj_map = pd.Series(dict(zip(map(str.lower,item_cleaned),item_cleaned)))

# escape the special characters
re_pat = '(%s)' % '|'.join([re.escape(i) for i in item_cleaned])

df1['item_final'] = df1['item_name'].str.extract(re_pat,flags=re.IGNORECASE)
df1['item_final'] = df1['item_final'].str.lower().map(obj_map)

obj_map

def    Def
ghi    Ghi
abc    Abc
dtype: object

df1

    item_name item_final
0     abc xyz        Abc
1     xuy DEF        Def
2  s GHI lsoe        Ghi
3   p ABc ois        Abc