微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

基于重复属性聚合 Python Pandas Dataframe 中的非重复数据

如何解决基于重复属性聚合 Python Pandas Dataframe 中的非重复数据

我有一个数据框,其中包含某些属性 ['Stage1'、'Stage2'、'Stage3'] 的重复数据和其他属性 ['Key'、'Agg1'、'Agg2'] 的非重复数据。我想匹配重复属性 ['Stage1','Stage2','Stage3'] 并聚合非重复属性。聚合属性应以逗号分隔。我不想聚合的属性 ['Title'] 应该被忽略。这是数据的示例。

我尝试了多个选项,包括重复和分组,但都没有达到这些结果。我对python相当陌生,所以请原谅黑客代码

code

import pandas as pd
import numpy as np

DataInput = pd.DataFrame(
np.array([["Key.1","Group 1","A","One","G_One,S_One",S_Two",S_Three"],["Key.2","B","Two",["Key.3","C","Three",["Key.4","Group 2","G_Two,["Key.5",["Key.6","Group 3","G_Three,["Key.7","Group 4",S_Two Different",S_Three"]]),columns=["Key","Title","Agg1","Agg2","Stage1","Stage2","Stage3"]
)

DataOutput = pd.DataFrame(
np.array([["Key.1,Key.2,Key.3","A,B,C","One,Two,Three",["Key.4,Key.5","Two,"Stage3"]
)

# Input
print(DataInput)
# Expected Output
print(DataOutput)


ColumnNames = ['Stage1','Stage3']
all_duplicates = DataInput.duplicated(subset=ColumnNames,keep=False)
unique_duplicates = DataInput.duplicated(subset=ColumnNames)

# Create listing of just the duplicates
duplicate_compare = all_duplicates.compare(unique_duplicates,keep_shape=True,keep_equal=True)
keeplist_bool = duplicate_compare['other'][duplicate_compare.other == False]
droplist_bool = duplicate_compare['other'][duplicate_compare.other == True]
# Listing of all the unique items that should be updated

DataInput_Keep = DataInput.loc[keeplist_bool.index]
DataInput_Drop = DataInput.loc[droplist_bool.index]
print(DataInput_Keep)
print(DataInput_Drop)
# Defaults to keep first
DataInput_unique = DataInput.drop_duplicates(subset=ColumnNames)
print(DataInput_Keep)
print(DataInput_Drop)
print(DataInput_unique)
# Ideas on using this to hack through the solution
# Create new DF with all the duplicates
# Drop all but the first duplicates from the original
# Iterate through original and update with the aggregates.

## Kind of works,missing data in Key.7
#grouped = DataInput.groupby(ColumnNames)['Key'].agg(','.join)#['Agg2'].agg(','.join)
#grouped = DataInput.groupby(ColumnNames)['Key'].apply(','.join)
#print(grouped)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。