DF =
id token argument1 argument2
1 Tza Tuvia Tza Moscow
2 perugia umbria perugia
3 associated the associated press Nelson
我现在想比较argumentX和token列的值,并相应地为新列ARG选择值.
DF =
id token argument1 argument2 ARG
1 Tza Tuvia Tza Moscow ARG1
2 perugia umbria perugia ARG2
3 associated the associated press Nelson ARG1
这是我尝试过的:
conditions = [
(DF["token"] == (DF["Argument1"])),
DF["token"] == (DF["Argument2"])]
choices = ["ARG1", "ARG2"]
DF["ARG"] = np.select(conditions, choices, default=nan)
这只会比较整个String,如果匹配则匹配. .isin,.contains等结构或使用诸如DF [“ ARG_cat”] = DF.apply(lambda row:row [‘token’] in row [‘argument2’],axis = 1)之类的辅助列无效.有任何想法吗?
解决方法:
将str.contains
与正则表达式一起使用-通过|连接令牌中的所有值用于正则表达式或具有单词边界的检查子字符串:
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in DF["token"])
conditions = [ DF["argument1"].str.contains(pat), DF["argument2"].str.contains(pat)]
choices = ["ARG1", "ARG2"]
DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
id token argument1 argument2 ARG
0 1 Tza Tuvia Tza Moscow ARG1
1 2 perugia umbria perugia ARG2
2 3 associated the associated ress Nelson ARG1
编辑:
如果要比较每一行:
d = {'id': [1, 2, 3],
'token': ["Tza","perugia","israel"],
"argument1": ["Tuvia Tza","umbria","Tuvia Tza"],
"argument2": ["israel","perugia","israel"]}
DF = pd.DataFrame(data=d)
print (DF)
id token argument1 argument2
0 1 Tza Tuvia Tza israel
1 2 perugia umbria perugia
2 3 israel Tuvia Tza israel
conditions = [[x[0] in x[1] for x in zip(DF['token'], DF['argument1'])],
[x[0] in x[1] for x in zip(DF['token'], DF['argument2'])]]
choices = ["ARG1", "ARG2"]
DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
id token argument1 argument2 ARG
0 1 Tza Tuvia Tza israel ARG1
1 2 perugia umbria perugia ARG2
2 3 israel Tuvia Tza israel ARG2
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。