在熊猫中比较两个数据帧时存储重复的行

如何解决在熊猫中比较两个数据帧时存储重复的行

大家好（我是 Python 新手）问题：我有 2 个数据帧 df1 和 df2，我想检查是否有基于相同（url、价格、pourcent）的重复项，然后将它们存储在新的数据框中还要检查是否有重复的 url 但价格发生变化并将它们存储在新的数据框中

df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','23.450','12'],['www.sercos.com.tn/after/','11.000','5'],['www.sercos.com.tn/new/','34.000','0'],['www.sercos.com.tn/Now/','14.750','11']],columns=['url','price','pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','13.890','18'],'10'],['www.sercos.com.tn/before/','pourcent'])

解决方法

这里有一些代码可以帮助您入门。这将创建两个示例数据帧，创建一个匹配 url 的新数据帧，然后最后检查行是否完全匹配。

#Sample df 1
df1 = pd.DataFrame({'url': ["urlone","urltwo","urlthree","urlfour"],'price': [1,2,3,4],'percent': [0.5,1,8]
                   })

#sample df 2
df2 = pd.DataFrame({'url': ["urlone","urlfive","urlsix"],8]
                   })


##This tells you all of the matches between the two columns and stores it in a variable called match
match = pd.match(df2['url'],df1['url'])

>>>print(match)
[ 0  2 -1 -1]
##The index tells you where the matches are in df2
##The number tells you where the corresponding match is in df1
##A value of -1 means no match
##You can copy both over to df3

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

#Iterate through match and add to df3
for n,i in enumerate(match):
    print(n)
    print(i)
    if i >= 0: # negative numbers are not matches
        print("Loop")
        df3 = df3.append(df1.iloc[i])
        df3 = df3.append(df2.iloc[n])


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()

附言如果您将代码包含在文本中以便其他人可以轻松运行，这将很有用 :)

使用您的数据框和使用 set 而不是 pd.match 更新变体


df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','23.450','12'],['www.sercos.com.tn/after/','11.000','5'],['www.sercos.com.tn/new/','34.000','0'],['www.sercos.com.tn/now/','14.750','11']],columns=['url','price','pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/','13.890','18'],'10'],['www.sercos.com.tn/before/','pourcent'])


##This tells you all of the matches between the two columns and stores it in a variable called match_set
match_set = set(df2['url']).intersection(df1['url'])

print(match_set)
#List of urls that match

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

for item in match_set:
    df3 = df3.append(df1.loc[df1['url'] == item])
    df3 = df3.append(df2.loc[df2['url'] == item])


#Iterate through match and add to df3


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()
print(df3)
print(df3.duplicated())