如何查找两个DataFrame中存在但所选列上具有不同值的行

如何解决如何查找两个DataFrame中存在但所选列上具有不同值的行

让我们说我们有两个具有相同列的数据框架df1和df2：

  | Name | Value1 | Value2      | Name | Value1 | Value2
--------------------------   --------------------------
1 | John |    1   |   2       1 | John |    4   |   2   
--------------------------   --------------------------
2 | Sue  |    1   |   2       2 | Sue  |    1   |   3   
--------------------------   --------------------------
3 | Bob  |    1   |   2       3 | Bob  |    5   |   6

我们可以看到，唯一的区别是Name为'John'的行，列Value1从1更改为4，而'Sue'的列Value2从2更改为更改为3，对于“鲍勃”，两列均更改。

我的问题是-对于每个此类更改，提取对（Name，Column(s)）对的最惯用的方法是什么？即使更改了实际值也没有关系，仅更改了它们对应的行和列。

我想写一个像这样的函数：

updated = check_for_updates(df1,df2)
print(updated)
# [
#   ("John",("Value1",)),#   ("Sue",("Value2",#   ("Bob","Value2")),# ]

解决方法

Pandas 1.1提供了一种方法来访问compare数据帧；您可以使用defaultdict进一步扩展它以适合最终结果：

from collections import defaultdict

updated = defaultdict(list)
for key,value in (df1
                   .set_index("Name")
                   .compare(df2.set_index("Name"),keep_shape=True)
                   .stack(0).index):
    updated[key].append(value)

print(updated)

defaultdict(list,{'John': ['Value1'],'Sue': ['Value2'],'Bob': ['Value1','Value2']})

我认为，如果您尝试df1.set_index('Name')和df2.set_index('Name')，则可以按照自己的意愿去做。我的意思是，然后您可以使用他们的名字提取

好的，我知道了，对此解决方案我很满意：

df1 = DataFrame(data={"Name": ["John","Sue","Bob"],"Value1": [1,1,1],"Value2": [2,2,2]})
df2 = DataFrame(data={"Name": ["John","Value1": [4,5],3,6]})

def check_for_updates(df1,df2,columns,index):
  result = df2[df1[columns] != df2[columns]].dropna(how="all") # unchanged rows do not interest me
  result[index] = df1[index]

  return [(_id,tuple(cols.dropna().index)) for _id,cols in result.set_index(index).iterrows()]

updated = check_for_updates(df1,columns=["Value1","Value2"],index="Name")
print(updated)
# [
#   ('John',('Value1',)),#   ('Sue',('Value2',#   ('Bob','Value2'))
# ]

但是我感觉（对熊猫不太了解）有更好的方法来做，所以请随时纠正我。

编辑：在编写此答案时，@ sammywemmy发布了一个替代方法，我认为这有点惯用。