如何在中断的情况下执行嵌套列表理解？

如何解决如何在中断的情况下执行嵌套列表理解？

我有一个很大的距离数据框要分类。

df_norms = pd.DataFrame([[0,100,200,4000000]],columns=['mode','min','medium','max'])

df_afst = pd.DataFrame([[0,50,-1],[0,150,250,-1]],columns = ['train','station','bbh'])

规范 DataFrame 表示，对于项目 0，每个距离

使用 for 循环很容易做到这一点。示例：

for i in [0]: # 1 element list just for the example
    bbh_id = i + 2
    mode = df_afst.iloc[0,i]
    for iy,y in enumerate(df_afst[df_afst.columns[i+1]].values):
        for ix,x in enumerate(df_norms.iloc[mode]):
            if x > y:
                df_afst.loc[iy,df_afst.columns[bbh_id]] = ix - 1
                break

之前：

   train  station  bbh
0      0       50   -1
1      0      150   -1
2      0        0   -1
3      0      250   -1

之后

   train  station  bbh
0      0       50    0
1      0      150    1
2      0        0    0
3      0      250    2

我想在列表理解中执行此操作，但不知道如何执行此操作：break 使其难以执行。我能做的最好的是：

for i in [0]:
    bbh_id = i + 2
    mode = df_afst.iloc[0,i]
    r = [ix - 1 for iy,y in enumerate(df_afst[df_afst.columns[i+1]].values)
                     for ix,x in enumerate(df_norms.iloc[mode])
                         if x > y]

 # results in : [0,1,2,2]

如您所见，如果拆分结果，结果是正确的：

[0,2 | 1,2 | 0,2 | 2]

我只需要子列表的第一个，不知道如何。我无法模拟 break。尝试了 min、[any][1] 和 next 但还是做对了。有人有什么想法吗？

更新

@chepner 正确地纠正了我的示例不一致。对不起。 @Thierry Lathuille 正确地指出列表推导式并不总是正确的工具。他在这一点上说得很对，因为我不知道它们何时是正确的工具，所以我想了解在这种情况下它是如何工作的。

我在这个答案中得到的两个答案对我很有启发。我从来没有听说过 Pandas cut，也从来没有为 numpy argwhere 烦恼过。

出于好奇，我做了一个小基准。

print('\n*** pd.cut')
cpu = time.time()
cuts = df_norms.iloc[0].tolist()
bbh3 = pd.cut(df_afst['station'],cuts,labels=False,include_lowest=True)
df_afst['bbh'] = bbh3
print('cpu {:.4f} seconds'.format(time.time() - cpu))
    
print('\n*** Using numpy and its functions')
cpu = time.time()
bbh2 = [np.min(np.argwhere(np.less(td,df_norms.values.ravel()))-1) for td in df_afst.station.values]
df_afst['bbh'] = bbh2
print('cpu {:.4f} seconds'.format(time.time() - cpu))

print('\n*** Simple loop')                
cpu = time.time()
for i in [0]:
    bbh_id = i + 2
    mode = df_afst.iloc[0,df_afst.columns[bbh_id]] = ix - 1
                break

print('cpu {:.4f} seconds'.format(time.time() - cpu))

print('\n*** Wrong approach')                
cpu = time.time()
for i in [0]:
    bbh_id = i + 2
    mode = df_afst.iloc[0,x in enumerate(df_norms.iloc[mode])
                         if x > y]
print('cpu {:.4f} seconds'.format(time.time() - cpu))

我将数据集从示例中的 4 扩大到 2,000,000，接近我的 10,000 数据集。我得到的结果很有趣：

*** pd.cut
cpu 0.0131 seconds

*** Using numpy and its functions
cpu 29.4257 seconds

*** Simple loop
cpu 214.5378 seconds

*** Wrong approach
cpu 103.5768 seconds

pandas cut 函数的加速令人难以置信。我仔细检查了结果，但看起来确实没问题。

两个答案，都正确且非常有见地。我决定将@carlos melus 的答案标记为正确答案，因为他最接近我要求的列表理解。

解决方法

您可以使用 numpy 对计算进行矢量化：

[np.min(np.argwhere(np.less(td,df_norms.values.ravel()))-1) for td in df_afst.station.values]

np.less 将 df_afst.station 中的每个距离与 df_norms 中的所有值进行比较，并返回一个布尔矩阵，如果 td 小于 df_norms 中的相应值，则值为 True。

例如 np.less(50,[0,100,200,4000000]) 返回：array([False,True,True])

使用 np.argwhere，我们提取输出数组中 True 值的索引，从 1 开始，因此我们减去 1 使其从 0 开始。从那里，获取数组中为 True 的最小索引，这就是您要查找的值。

您可以在列表推导式中运行所有这些，结果将是：[0,1,2]

您将从使用 pd.cut() 中受益匪浅：

假设您想将 df_afst['station'] 中的值合并（从问题中并不完全清楚，但我根据示例进行猜测），您可以这样做：

cuts = df_norms.iloc[0].tolist()
bbh = pd.cut(df_afst['station'],cuts,labels=False,include_lowest=True)

或更直接：

bbh = pd.cut(df_afst['station'],[-1,float('inf')],labels=False)

结果：

>>> bbh
0    0
1    1
2    0
3    2
Name: station,dtype: int64

当然，您也可以将其分配给一列。

这将比 Python 循环（显式循环或推导式）快几个数量级。

如何在中断的情况下执行嵌套列表理解？

如何解决如何在中断的情况下执行嵌套列表理解？

解决方法

相关推荐