大熊猫在组内变化一年

如何解决大熊猫在组内变化一年

我希望在 Pandas 中创建一个滞后的年度回报变量。

到目前为止我已经尝试过：

df_ret_lagged = df_ret.set_index(['year','cusip'])
df_ret_lagged['yearly_ret_lag'] = df_ret_lagged['year_ret'].shift(12)
df_ret_lagged.reset_index(inplace = True)

然而，这只是将年度回报向下移动 12 行，而不是按年分组。下面的数据框显示了代码所做的事情。

    year    cusip        date       year_ret    yearly_ret_lag
0   1983    000165100   1983-09-01  0.183673    NaN
1   1983    000165100   1983-10-01  0.183673    NaN
2   1983    000165100   1983-11-01  0.183673    NaN
3   1983    000165100   1983-12-01  0.183673    NaN
4   1984    000165100   1984-01-01  -0.482758   NaN
5   1984    000165100   1984-02-01  -0.482758   NaN
6   1984    000165100   1984-03-01  -0.482758   NaN
7   1984    000165100   1984-04-01  -0.482758   NaN
8   1984    000165100   1984-05-01  -0.482758   NaN
9   1984    000165100   1984-06-01  -0.482758   NaN
10  1984    000165100   1984-07-01  -0.482758   NaN
11  1984    000165100   1984-08-01  -0.482758   NaN
12  1984    000165100   1984-09-01  -0.482758   0.183673
13  1984    000165100   1984-10-01  -0.482758   0.183673
14  1984    000165100   1984-11-01  -0.482758   0.183673
15  1984    000165100   1984-12-01  -0.482758   0.183673
16  1985    000165100   1985-01-01  1.700000    -0.482758
17  1985    000165100   1985-02-01  1.700000    -0.482758
18  1985    000165100   1985-03-01  1.700000    -0.482758
19  1985    000165100   1985-04-01  1.700000    -0.482758

理想情况下，我希望 1983 year_ret 填满 1984 年的所有日期，依此类推。此外，这些必须按 cusip（公司标识符）分组。

谢谢！

解决方法

我使用了 for 循环：

for year in df['year'].unique()[1:]: #list of all the years except the first
    df.loc[df['year'] == year,'year_ret_lag'] = df.loc[df['year'] == year-1,'year_ret'].iloc[0]

    year    cusip   date        year_ret    year_ret_lag
0   1983    165100  01/09/1983  0.183673    NaN
1   1983    165100  01/10/1983  0.183673    NaN
2   1983    165100  01/11/1983  0.183673    NaN
3   1983    165100  01/12/1983  0.183673    NaN
4   1984    165100  01/01/1984  -0.482758   0.183673
5   1984    165100  01/02/1984  -0.482758   0.183673
6   1984    165100  01/03/1984  -0.482758   0.183673
7   1984    165100  01/04/1984  -0.482758   0.183673
8   1984    165100  01/05/1984  -0.482758   0.183673
9   1984    165100  01/06/1984  -0.482758   0.183673
10  1984    165100  01/07/1984  -0.482758   0.183673
11  1984    165100  01/08/1984  -0.482758   0.183673
12  1984    165100  01/09/1984  -0.482758   0.183673
13  1984    165100  01/10/1984  -0.482758   0.183673
14  1984    165100  01/11/1984  -0.482758   0.183673
15  1984    165100  01/12/1984  -0.482758   0.183673
16  1985    165100  01/01/1985  1.700000    -0.482758
17  1985    165100  01/02/1985  1.700000    -0.482758
18  1985    165100  01/03/1985  1.700000    -0.482758
19  1985    165100  01/04/1985  1.700000    -0.482758

我想这可能就是您要的。请注意，这依赖于提前对数据框进行正确排序和结构化（例如，每个月都有条目）。

这会在移动之前按 cusip 和 date 对所有内容进行排序，然后通过用 nans 覆盖它们来擦除 cusips 之间不匹配的值。然后您可以使用 .fillna(method='bfill') 获取其中较早的值。

df_new = df_ret.sort_values(['cusip','date'])
df_new['yearly_ret_lag'] = df_new['year_ret'].shift(12)
df_new.loc[ (df_new['cusip'] != df_new['cusip'].shift(12)),'yearly_ret_lag'] = np.nan
df_new['yearly_ret_lag'] = df_new['yearly_ret_lag'].fillna(method='bfill')

另一个没有循环的解决方案，它让你有可能丢失月份，或者 groupby cusid 是：

df 的构建：

dates = pd.date_range("1983-09-01","1985-12-31",freq="1M")
df = pd.DataFrame(index =dates,columns=["Year","cusip","year_ret"])
df['Year'] = df.index
df['Year'] = df['Year'].dt.strftime(date_format='%Y')
df['cusip'] = '01234'
df['year_ret'] =[0.183673,0.183673,-0.482758,1.700000,1.700000]

还有代码：

#### First,condition if year changes
_condition_1 = df.Year != df.Year.shift(1)

#### If condition is True,put the past ret as new
df['lag'] = np.where(_condition_1,df['year_ret'].shift(1),np.nan)

#### Fill the nan,and it's ok
df = df.fillna(method='ffill')