如何解决找出每月不等值的平均值并根据某些条件进行分配
我目前正在努力将我的数据转换为有用的数据集。我需要从第一个月到最后一个月平均分配付款。问题是付款不一致和不平等。此外,有些付款已全额支付,应从第一笔付款加上根据协议数据框适用的期限进行分配。
我的表格如下:
第一个表:付款
cust_id | agreement_id | 日期 | 付款 |
---|---|---|---|
1 | A | 12/1/20 | 200 |
1 | A | 2/2/21 | 200 |
1 | A | 2/3/21 | 100 |
1 | A | 5/1/21 | 200 |
1 | B | 1/2/21 | 50 |
1 | B | 1/9/21 | 20 |
1 | B | 3/1/21 | 80 |
1 | B | 4/23/21 | 90 |
2 | C | 1/21/21 | 600 |
3 | D | 3/4/21 | 150 |
3 | D | 5/3/21 | 150 |
这是支付数据框的代码:
payments = pd.DataFrame.from_dict({'cust_id': {0: 1,1: 1,2: 1,3: 1,4: 1,5: 1,6: 1,7: 1,8: 2,9: 3,10: 3},'agreement_id': {0: 'A',1: 'A',2: 'A',3: 'A',4: 'B',5: 'B',6: 'B',7: 'B',8: 'C',9: 'D',10: 'D'},'date': {0: '12/1/20',1: '2/2/21',2: '2/3/21',3: '5/1/21',4: '1/2/21',5: '1/9/21',6: '3/1/21',7: '4/23/21',8: '1/21/21',9: '3/4/21',10: '5/3/21'},'payment': {0: 200,1: 200,2: 100,3: 200,4: 50,5: 20,6: 80,7: 90,8: 600,9: 150,10: 150}})
第二表:协议
agreement_id | 激活 | term_months | total_fee |
---|---|---|---|
A | 12/1/20 | 24 | 4800 |
B | 1/21/21 | 6 | 600 |
C | 1/21/21 | 6 | 600 |
D | 3/4/21 | 6 | 300 |
这是协议数据框的代码:
agreement = pd.DataFrame.from_dict({'agreement_id': {0: 'A',1: 'B',2: 'C',3: 'D'},'activation': {0: '12/1/20',1: '1/2/21',2: '1/21/21',3: '3/4/21'},'term_months': {0: 24,1: 6,2: 6,3: 6},'total_fee': {0: 4800,1: 300,2: 600,3: 300}})
我想要的结果如下:
cust_id | agreement_id | 日期 | 付款 |
---|---|---|---|
1 | A | 12/1/20 | 116.67 |
1 | A | 1/1/21 | 116.67 |
1 | A | 2/1/21 | 116.67 |
1 | A | 3/1/21 | 116.67 |
1 | A | 4/1/21 | 116.67 |
1 | A | 5/1/21 | 116.67 |
1 | B | 1/1/21 | 60 |
1 | B | 2/1/21 | 60 |
1 | B | 3/1/21 | 60 |
1 | B | 4/1/21 | 60 |
2 | C | 1/1/21 | 100 |
2 | C | 2/1/21 | 100 |
2 | C | 3/1/21 | 100 |
2 | C | 4/1/21 | 100 |
2 | C | 5/1/21 | 100 |
2 | C | 6/1/21 | 100 |
3 | D | 3/1/21 | 50 |
3 | D | 4/1/21 | 50 |
3 | D | 5/1/21 | 50 |
3 | D | 6/1/21 | 50 |
3 | D | 7/1/21 | 50 |
3 | D | 8/1/21 | 50 |
或者,以代码形式:
cust_id agreement_id date payment
0 1 A 12/1/20 116.67
1 1 A 1/1/21 116.67
2 1 A 2/1/21 116.67
3 1 A 3/1/21 116.67
4 1 A 4/1/21 116.67
5 1 A 5/1/21 116.67
6 1 B 1/1/21 60.00
7 1 B 2/1/21 60.00
8 1 B 3/1/21 60.00
9 1 B 4/1/21 60.00
10 2 C 1/1/21 100.00
11 2 C 2/1/21 100.00
12 2 C 3/1/21 100.00
13 2 C 4/1/21 100.00
14 2 C 5/1/21 100.00
15 2 C 6/1/21 100.00
16 3 D 3/1/21 50.00
17 3 D 4/1/21 50.00
18 3 D 5/1/21 50.00
19 3 D 6/1/21 50.00
20 3 D 7/1/21 50.00
21 3 D 8/1/21 50.00
激活与第一次付款的日期相同。
我尝试使用以下代码(由 AlexK 建议)创建另一列,但仅当总付款少于总费用时才适用。但是,当总付款等于总费用时,我需要从付款开始到月底(开始加上月数)相应地分配付款。
payments['date'] = pd.to_datetime(payments['date'])
resampled_payments = (payments
.set_index('date')
.groupby(['cust_id','agreement_id'])
.resample('MS')
.agg({'payment': sum})
.reset_index()
)
resampled_payments['avg_monthly_payment'] = (resampled_payments
.groupby(['cust_id','agreement_id'])['payment']
.transform('mean')
)
解决方法
这是 R
解决方案(因为您也用 R 标记了它)
#load libraries
library(tidyverse)
library(lubridate)
pymts <- read.table(text = "cust_id agreement_id date payment
1 A 12/1/20 200
1 A 2/2/21 200
1 A 2/3/21 100
1 A 5/1/21 200
1 B 1/2/21 50
1 B 1/9/21 20
1 B 3/1/21 80
1 B 4/23/21 90
2 C 1/21/21 600
3 D 3/4/21 150
3 D 5/3/21 150",header = T)
agmt <- read.table(text = "agreement_id activation term_months total_fee
A 12/1/20 24 4800
B 1/21/21 6 600
C 1/21/21 6 600
D 3/4/21 6 300",header = T)
#final code
final<- pymts %>% mutate(date = as.Date(date,"%m/%d/%y")) %>%
left_join(agmt %>% mutate(activation = as.Date(activation,"%m/%d/%y")),by = "agreement_id") %>%
group_by(cust_id,agreement_id) %>%
mutate(d = n(),date = floor_date(date,"month")) %>%
complete(date = seq.Date(from = min(date),by = "month",length.out = ifelse(sum(payment) == first(total_fee),first(term_months),(year(max(date)) -
year(min(date)))*12 +
month(max(date)) -
month(min(date)) +1))) %>%
mutate(payment = sum(payment,na.rm = T)) %>%
filter(!duplicated(date)) %>%
mutate(payment = payment/n()) %>%
select(1:4) %>% ungroup()
final
# A tibble: 22 x 4
cust_id agreement_id date payment
<int> <chr> <date> <dbl>
1 1 A 2020-12-01 117.
2 1 A 2021-01-01 117.
3 1 A 2021-02-01 117.
4 1 A 2021-03-01 117.
5 1 A 2021-04-01 117.
6 1 A 2021-05-01 117.
7 1 B 2021-01-01 60
8 1 B 2021-02-01 60
9 1 B 2021-03-01 60
10 1 B 2021-04-01 60
# ... with 12 more rows
,
鉴于您的数据框,这应该可以工作
from dateutil.relativedelta import relativedelta
# Transofrm column to date
payments['date']= pd.to_datetime(payments['date'])
agreement['activation']= pd.to_datetime(agreement['activation'])
final =pd.merge(payments,agreement,on='agreement_id',how='left')
# set date to beginning of month
final['date'] = pd.to_datetime(final.date).dt.to_period('M').dt.to_timestamp()
def set_date_range(df):
if df['payment'].sum() == df['total_fee'].iloc[0]:
return pd.date_range(min(g['date']),periods=df['term_months'].iloc[0],freq='M')
else:
return pd.date_range(min(g['date']),max(g['date'])+relativedelta(months=+1),freq='M' )
# Create dataframe with dates
seq_df = pd.DataFrame()
for i,g in final.groupby(['cust_id','agreement_id']):
seq_df = pd.concat([seq_df,pd.DataFrame({'cust_id': i[0],'agreement_id': i[1],'date': set_date_range(g)})])
# Set date to beginnig of month
seq_df['date'] = pd.to_datetime(seq_df.date).dt.to_period('M').dt.to_timestamp()
final = (pd.concat([final,seq_df],sort=True)
.sort_values(['cust_id','agreement_id','date'])
.reset_index(drop=True)
.reindex(columns=final.columns))
final['payment'] = final.groupby(by=['cust_id','agreement_id'])["payment"].transform("sum")
final = final.drop_duplicates(['cust_id','date'])
final['n'] = final.groupby(by=['cust_id','agreement_id'])["cust_id"].transform("count")
final['payment_due'] = final['payment']/final['n']
final[['cust_id','date','payment_due']]
我无法完全复制管道形式 tidyverse
,但输出应该匹配。最难的部分是 seq_df
的创建,但应该没问题(针对更一般的用例进行双重测试)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。