微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

python – 从pandas中的组中获取最新值

我有一个具有以下结构的数据帧

Debtor ID    | Accountrating    | AccountratingDate   | AmountOutstanding    |AmountPastDue
John SNow      Closed             2017-03-01            0                     0
John SNow      Delayed            2017-04-22            2000                  500
John SNow      Closed             2017-05-23            0                     0
John SNow      Delayed            2017-07-15            6000                  300
Sarah Parker   Closed             2017-02-01            0                     0
Edward Hall    Closed             2017-05-01            0                     0
Douglas Core   Delayed            2017-01-01            1000                  200
Douglas Core   Delayed            2017-06-01            1000                  400

我想要实现的是

Debtor ID    | Incidents of delay    | TheMostRecentOutstanding    | TheMostRecentPastDue
John SNow      2                       6000                          300
Sarah Parker   0                       0                             0
Edward Hall    0                       0                             0
Douglas Core   2                       1000                          400

计算延迟事件非常简单

df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()

我正在努力提取最新的amonts和pastdue.我的代码是这样的

new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["Accountrating"]=='Delayed'].groupby('Debtor ID')["AccountratingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["Accountrating"]=='Delayed'].groupby('Debtor ID')["AccountratingDate"].idxmax(),"AmountPastDue"]

但他们返回具有所有NaN值的系列.请帮帮我,我在这里做错了什么?

解决方法

你可以试试这个:

df.sort_values('AccountratingDate')\
  .query('Accountrating == "Delayed"')\
  .groupby('Debtor ID')[['Accountrating','AmountOutstanding','AmountPastDue']]\
  .agg({'Accountrating':'count','AmountOutstanding':'last','AmountPastDue':'last'})\
  .reindex(df['Debtor ID'].unique(),fill_value=0)\
  .reset_index()

输出

Debtor ID  Accountrating  AmountOutstanding  AmountPastDue
0     John SNow              2               6000            300
1  Sarah Parker              0                  0              0
2   Edward Hall              0                  0              0
3  Douglas Core              2               1000            400

细节:

>首先按AccountratingDate排序数据框,以获取最后一个日期
最后一项记录
>将数据帧仅过滤到Accountrating等于的数据帧
‘延迟’
> Groupby Debtor ID与要聚合的列,然后使用agg与a
字典表示如何聚合每列
>使用Debtor ID的唯一值重新索引以填充零
没有任何延误
>并且,重置索引.

并且,您可以使用重命名和字典进行列重命名

df.sort_values('AccountratingDate')\
  .query('Accountrating == "Delayed"')\
  .groupby('Debtor ID')[['Accountrating',fill_value=0)\
  .rename(columns={'Accoutrating':'Incidents of delay','AmountOutstanding':'TheMostRecentOutstanding','AmountPastDue':'TheMostRecentPastDue'})\
  .reset_index()

输出

Debtor ID  Accountrating  TheMostRecentOutstanding  TheMostRecentPastDue
0     John SNow              2                      6000                   300
1  Sarah Parker              0                         0                     0
2   Edward Hall              0                         0                     0
3  Douglas Core              2                      1000                   400

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐