Python Pandas：按一列分组，仅在另一列中汇总，但采用相应的数据

如何解决Python Pandas：按一列分组，仅在另一列中汇总，但采用相应的数据

我还看到了许多其他相关的SO问题，例如this和this，但它们似乎并不是我想要的。假设我有一个像这样的数据框：

import pandas as pd
df = pd.DataFrame(columns=['patient','parent csn','child csn','days'])
df.loc[0] = [0,10,5]
df.loc[1] = [0,11,3]
df.loc[2] = [0,1,12,6]
df.loc[3] = [0,13,4]
df.loc[4] = [1,2,20,4]
df
Out[9]: 
  patient parent csn child csn days
0       0          0        10    5
1       0          0        11    3
2       0          1        12    6
3       0          1        13    4
4       1          2        20    4

现在我想做的是这样的：

grp_df = df.groupby(['parent csn']).min()

问题在于，结果计算出所有列（不是parent csn）中的最小值，并产生：

grp_df
            patient  child csn  days
parent csn                          
0                 0         10     3
1                 0         12     4
2                 1         20     4

您可以看到，对于第一行，days号和child csn号不再在同一行，就像在分组之前一样。这是我想要的输出：

grp_df
            patient  child csn  days
parent csn                          
0                 0         11     3
1                 0         13     4
2                 1         20     4

我该怎么办？我有遍历数据帧的代码，并且我认为它可以工作，但是即使使用Cython，它的速度也很慢。我觉得这应该很明显，但是我没有发现。

我也查看了this问题，但是将child csn放在groupby列表中是行不通的，因为child csn随days的不同而变化。

This问题似乎更可能出现，但是我发现解决方案不是很直观。

This问题似乎也有可能出现，但同样，答案也不是很直观，而且我确实希望每个parent csn仅排一行。

另一个细节：包含最小days值的行可能不是唯一的。在那种情况下，我只想要一行-我不在乎。

非常感谢您的宝贵时间！

解决方法

作为所需的输出，您需要sort_values和groupby first

df_final = (df.sort_values(['parent csn','patient','days','parent csn'])
              .groupby('parent csn').first())

Out[813]:
            patient  child csn  days
parent csn
0                 0         11     3
1                 0         13     4
2                 1         20     4

您可以通过使用.idxmin()而不是.min()来获得索引（行标识符），其中每个组的“天数”最少：

数据创建：

import pandas as pd

data = [[0,10,5],[0,11,3],1,12,6],13,4],[1,2,20,4]]
df = pd.DataFrame(data,columns=['patient','parent csn','child csn','days'])

print(df)
   patient  parent csn  child csn  days
0        0           0         10     5
1        0           0         11     3
2        0           1         12     6
3        0           1         13     4
4        1           2         20     4

day_minimum_row_indices = df.groupby("parent csn")["days"].idxmin()

print(day_minimum_row_indices)
parent csn
0    1
1    3
2    4
Name: days,dtype: int64

由此您可以看到，组父csn 0在第1行的天数最少。回头看原始数据帧，我们可以看到第1行的天数== 3，实际上是最小天数的位置。父csn == 0的天数。父csn == 1在第3行的最小天数，依此类推。

我们可以使用行索引将其子集回到原始数据帧中：

new_df = df.loc[day_minimum_row_indices]

print(new_df)
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

编辑（tldr）：

df.loc[df.groupby("parent csn")["days"].idxmin()]

您可以使用groupby创建过滤器，而不是仅使用.groupby：

s = df.groupby('parent csn')['days'].transform('min') == df['days']
df = df[s]
df

Out[1]: 
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

例如，这就是将s放入数据框中的样子。然后，您只需过滤True行，这些行是每组最少天数等于该行的行。

Out[2]: 
   patient  parent csn  child csn  days      s
0        0           0         10     5  False
1        0           0         11     3   True
2        0           1         12     6  False
3        0           1         13     4   True
4        1           2         20     4   True

由于某种原因，我无法解释您的数据框的列类型为object。此解决方案仅适用于数字列

df.days = df.days.astype(int)
df.iloc[df.groupby('parent csn').days.idxmin()]

出局：

  patient parent csn child csn  days
1       0          0        11     3
3       0          1        13     4
4       1          2        20     4

Python Pandas：按一列分组，仅在另一列中汇总，但采用相应的数据

如何解决Python Pandas：按一列分组，仅在另一列中汇总，但采用相应的数据

解决方法

相关推荐