将函数应用于数据帧子集的最佳方法是什么？

如何解决将函数应用于数据帧子集的最佳方法是什么？

我目前正在使用 NFL 数据集。该数据集非常细化，但非常简单，每一行代表给定帧、给定游戏、给定游戏中玩家的位置（相对于场地的 x,y 坐标）。您可以将一帧视为时间快照，在该点上，我们记录了场上所有球员的坐标，并将它们作为数据帧中每个球员的一行输入。每场比赛大约有 70 帧，每场比赛有大约 80 场比赛，我们有 250 多场比赛

我想要的是，对于某些进攻球员（特别是外接手），确定他们最近的防守者是谁，以及他们有多远。所以理想情况下，我会应用一个函数，它接收一个帧，并输出两列，只为外接手填充 - 两列是最近的防守者和他们到 WR 的距离。我会在所有帧、所有游戏、所有游戏中使用该功能。

我正在努力想出一个有效的解决方案。我使用 scipy 中的 distance_matrix 函数来计算这两个指标，但目前最困难的是遍历游戏、游戏和帧的 1MM+ 组合。

我正在考虑使用 apply 函数来获得结果，但它仍然涉及迭代游戏、游戏和框架的各种组合。我在想也许甚至有一个矢量化的解决方案，但我想不出任何东西。

这里的任何建议都会非常有帮助 - 我在下面粘贴了我当前的工作代码，它只使用 for 循环并且需要很长时间

temp = pd.DataFrame()
# For each game,and each play within the games,and frames within the play
for game_id in test.gameId.unique():
    for play_id in test[test.gameId==game_id].playId.unique():
        for frame_id in test[(test.playId==play_id)&(test.gameId==game_id)].frameId.unique():
            print("Game: {} | Play: {} | Frame: {}".format(game_id,play_id,frame_id))
            
            # Filter the dataframe on a given frame,within a given play,within a given game
            df = test[(test.gameId==game_id)&
                    (test.playId==play_id)&
                    (test.frameId==frame_id)
                   ]
            
            # Isolate the wide receivers
            df_wr = df[(df["inPoss"]==1)&(df['position']=="WR")]
            # Isolate the defenders
            df_d =  df[df["inPoss"]==0]
            # Calculate the distance matrix between each WR and defenders
            dm = distance_matrix(df_wr[['x','y']].values,df_d[['x','y']].values)
            
            # use argmin and min to record the closest defender,and their distance
            closest_defender = dm.argmin(axis=1)
            closest_defender_distance = dm.min(axis=1)
            
            # Create a dataframe to record  the information
            for i,j in enumerate(closest_defender):
                temp_df = pd.DataFrame({
                    'gameId':[game_id],'playId':[play_id],'frameId':[frame_id],'displayName':[df_wr.displayName.iloc[i]],'closestDefender':[df_d.displayName.iloc[j]],'closestDefenderdistance':[closest_defender_distance[i]]
                })

                temp = pd.concat([temp,temp_df])

解决方法

显然我没有任何数据，所以我无法稳健地测试我的代码。但我可以阐明一些指导原则。

你不想做这么多子集。为避免这种情况，您可以按游戏、游戏和框架进行分组：

for g,grouped_df in test.groupby(['gameId','playId','frameId']):
    ...  # do your isolation stuff here

这也意味着您无需在继续使用您现在正在使用的代码的同时进行子集设置。如果你要这样做，你不应该经常连接到你现有的数据框。相反，创建一个数据框列表并在最后连接。即：

temp = []
for ... in ...:
    result_df = ...  # how you produce the result
    temp.append(result_df)

final = pd.concat(temp,axis='rows')

您也可以将整个事物简化为一个函数，然后将其应用于 groupby。该函数将具有签名：

def complex_function(df):
    ...  # it can return multiple columns and rows as well

result = test.groupby(['gameId','frameId']).apply(complex_function)

在您的 groupby.apply 中返回数据框有点棘手。返回的数据帧的索引会广播到您的结果索引，并且可能需要重置或展平。但是，这些列已正确广播。