如何解决从匹配的数据帧拆分中查找下一个
如下所示的数据框和名称列表。
['Amelia','Elijah','Amelia']
我想知道下一个是谁,当数据框的一部分与给定的名字匹配时(名字列表是一个固定的序列)。 (这是 1990-09-01 00:00:00 詹姆斯)
import pandas as pd
from io import StringIO
to_find_list = ['Amelia','Amelia']
short_frame = 3
csvfile = StringIO(
"""Date Staff
1990-05-01 00:00:00 Mason
1990-06-01 00:00:00 Amelia
1990-07-01 00:00:00 Elijah
1990-08-01 00:00:00 Amelia
1990-09-01 00:00:00 James
1990-10-01 00:00:00 Benjamin
1990-11-01 00:00:00 Isabella
1990-12-01 00:00:00 Lucas
1991-01-01 00:00:00 Mason""")
df = pd.read_csv(csvfile,sep = '\t',engine='python')
# split the df into small frames with overlaps
list_of_dfs = [df.loc[i:i + short_frame-1,:].reset_index(drop=True) for i in range(0,len(df),short_frame - 2) if i < len(df) - 2]
for son_df in list_of_dfs:
first_cell = son_df.iloc[0]['Date']
last_cell = son_df.iloc[-1]['Date']
if son_df['Staff'].to_list() == to_find_list:
found_date = son_df['Date'].iloc[-1] # 1990-08-01 00:00:00
who = df['Staff'].loc[df['Date'] == found_date] # Amelia
我尝试使用 shift() 在“Amelia”旁边打印下一个日期和人员,但没有成功。
实现它的方法是什么?谢谢。
解决方法
您可以尝试 extract()
并获取值出现的索引:
idx=df['Staff'].str.extract(f'({"|".join(to_find_list)})',expand=False).dropna().index
最后传递那个索引:
out=df.loc[[x+3 for x in idx if x <=len(df)]]
#^
#if you add 1 then you will get the 1st member of next staff
out
的输出:
Date Staff
4 1990-09-01 00:00:00 James
5 1990-10-01 00:00:00 Benjamin
6 1990-11-01 00:00:00 Isabella
或
out=df.loc[[x+3 for x in idx if x <=len(df)],'Staff']
#^
#if you add 1 then you will get the 1st member of next staff
out
的输出:
4 James
5 Benjamin
6 Isabella
性能:
,您可以使用 pd.DataFrame shift() 函数创建新列。然后进行列表推导以匹配 to_find_list 与转换为列表的列。
>>> df['Staff_prev'] = df['Staff'].shift(1)
>>> df['Staff_prev2'] = df['Staff'].shift(2)
>>> df['Staff_prev3'] = df['Staff'].shift(3)
>>> df['my_row'] = [ to_find_list == [ row['Staff_prev'],row['Staff_prev2'],row['Staff_prev3'] ] for index,row in df.iterrows() ]
>>> df.head()
Date Staff Staff_prev Staff_prev2 Staff_prev3 my_row
0 1990-05-01 00:00:00 Mason NaN NaN NaN False
1 1990-06-01 00:00:00 Amelia Mason NaN NaN False
2 1990-07-01 00:00:00 Elijah Amelia Mason NaN False
3 1990-08-01 00:00:00 Amelia Elijah Amelia Mason False
4 1990-09-01 00:00:00 James Amelia Elijah Amelia True
>>> df.loc[df['my_row'] == True,'Date']
1990-09-01 00:00:00
,
让我们做
m = pd.concat([df['Staff'].shift(x)==y for x,y in zip(range(3),['Amelia','Elijah','Amelia'])]).all(level=0)
idx = m.index[m]+1
idx
Int64Index([4],dtype='int64')
df.loc[idx]
Date Staff
4 1990-09-01 00:00:00 James
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。