微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

在熊猫中比较两个数据帧时存储重复的行

如何解决在熊猫中比较两个数据帧时存储重复的行

大家好(我是 Python 新手)问题: 我有 2 个数据帧 df1 和 df2, 我想检查是否有基于相同(url、价格、pourcent)的重复项,然后将它们存储在新的数据框中 还要检查是否有重复的 url 但价格发生变化并将它们存储在新的数据框中

df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','23.450','12'],['www.sercos.com.tn/after/','11.000','5'],['www.sercos.com.tn/new/','34.000','0'],['www.sercos.com.tn/Now/','14.750','11']],columns=['url','price','pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','13.890','18'],'10'],['www.sercos.com.tn/before/','pourcent'])

解决方法

这里有一些代码可以帮助您入门。这将创建两个示例数据帧,创建一个匹配 url 的新数据帧,然后最后检查行是否完全匹配。

#Sample df 1
df1 = pd.DataFrame({'url': ["urlone","urltwo","urlthree","urlfour"],'price': [1,2,3,4],'percent': [0.5,1,8]
                   })

#sample df 2
df2 = pd.DataFrame({'url': ["urlone","urlfive","urlsix"],8]
                   })


##This tells you all of the matches between the two columns and stores it in a variable called match
match = pd.match(df2['url'],df1['url'])

>>>print(match)
[ 0  2 -1 -1]
##The index tells you where the matches are in df2
##The number tells you where the corresponding match is in df1
##A value of -1 means no match
##You can copy both over to df3

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

#Iterate through match and add to df3
for n,i in enumerate(match):
    print(n)
    print(i)
    if i >= 0: # negative numbers are not matches
        print("Loop")
        df3 = df3.append(df1.iloc[i])
        df3 = df3.append(df2.iloc[n])


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()

附言如果您将代码包含在文本中以便其他人可以轻松运行,这将很有用 :)

使用您的数据框和使用 set 而不是 pd.match 更新变体


df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','23.450','12'],['www.sercos.com.tn/after/','11.000','5'],['www.sercos.com.tn/new/','34.000','0'],['www.sercos.com.tn/now/','14.750','11']],columns=['url','price','pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','13.890','18'],'10'],['www.sercos.com.tn/before/','pourcent'])


##This tells you all of the matches between the two columns and stores it in a variable called match_set
match_set = set(df2['url']).intersection(df1['url'])

print(match_set)
#List of urls that match

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

for item in match_set:
    df3 = df3.append(df1.loc[df1['url'] == item])
    df3 = df3.append(df2.loc[df2['url'] == item])


#Iterate through match and add to df3


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()
print(df3)
print(df3.duplicated())

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。