尝试加速统计scipy计算失败

如何解决尝试加速统计scipy计算失败

我有一个包含 30000 行的数据框，我正在尝试计算以下内容：


def calc_exp(x,y,z):
    return stats.lognorm(scale = x,s=y).expect(lb=0,ub=z)

原则上使用for循环太慢了，但事实证明例如使用apply：

   df['expect'] = df.apply(lambda row: calc_exp(row['sev_est'],row['sev_obs'],row['_exp_up_b']),axis=1)

但是对于这么大的数据帧，运行大约需要 30 分钟。

我也尝试使用 swifter：https://pypi.org/project/swifter/ 但不幸的是它只是超时，所以我不知道是否真的有效。

当我尝试这个时：

   df['expect'] = df.swifter.apply(lambda row: calc_exp(row['sev_est'],axis=1)

它告诉我：

ValueError: The truth value of a Series is ambiguous. Use a.empty,a.bool(),a.item(),a.any() or a.all().

和

TimeoutError: [Errno 60] Operation timed out

您有什么建议或资源可以帮助我加快代码速度吗？

感谢您检查我的问题。

以下是数据示例：

sample_data = [[47752.06433069426,1.0357302794658065,1002500.0],[755000.4713730852,1.2872612468359987,[47752.06433069426,1001000.0],[57777.829574251584,1.0312698002906302,505000.0],[69703.2113095638,1.0299427372756402,2010000.0],[45136.11776248943,1.0376444095922805,[59132.70813622853,1.0309407576584755,1005000.0],[43453.5613190105,1.0390872317110278,[46135.3578194443,1.0368683857223946,[152082.89966620493,1.113359803894905,[750446.5937632995,1.2874160597603732,2002500.0],[53647.95567675417,1.0326342806970585,[45632.05708701799,1.0372520411746433,[46581.28690183377,1.0365403645539104,[54020.70347965895,1.0324868172323447,[44245.90842544985,1.0383857082697099,[51162.12793486834,1.0337556400084722,3025000.0],[107107.86948225822,1.0383712653241755,[722119.4508688038,1.2884596680530922,2025000.0],[699903.7852476649,1.2893820587789995,[48950.10419174958,1.034974068892171,[51738.02683120212,1.033473643110517,[42901.866305524214,1.0395999387162294,[52614.717261016325,1.0330705313839323,[713225.3474413318,1.2888174754031487,[57238.01599926183,1.0314162657284809,5010000.0],[58322.385654926955,1.0311310363507877,2005000.0],[54045.05749769092,1.0324773610543825,[42604.59804488991,1.039884727964432,1010000.0],[92437.93072760757,1.0336932004002017,1015000.0],[88559.61192571945,1.0326806977341927,[602239.5164807418,1.2947510218453246,[54815.71051455691,1.0321892223847444,[75898.55658703072,1.0303477696462269,[47915.02339498793,1.0356232022658125,252500.0],[44569.2886493234,1.0381108112654438,502500.0],[56189.65757026716,1.0317269545642969,[48074.68313484722,1.0355196014148147,[44310.501871381675,1.0383302791240467,5015000.0],[45861.05802340964,1.0370756881518186,[131930.89613964988,1.0481212391831674,1035000.0],[49212.10735631669,1.0348180280571426,[57489.35708029983,1.0313469517526663,[45341.39170748647,1.0374802534677299,[50560.21189377367,1.0340654819168598,[43881.69750318484,1.0387031703640284,[45265.59784677777,1.0375405752126938,[47415.20348405406,1.035955953776894,[55740.86462284447,1.0318709081324582,[45220.34890828568,1.037576748920849,[48535.991864874784,1.035227436919951,[127047.67786919193,1.0460887972573536,2100000.0],[46492.46532127476,1.0366048236011984,[541310.5576520629,1.2994907072560145,[799382.0622031174,1.2859240295491776,3010000.0],[44719.592998634034,1.037985238438699,[49927.12596481778,1.0344085721774596,[49663.66474485245,1.0345566887409923,[46877.59739500191,1.0363284400408517,[49255.78753338048,1.0347923305430622,[49822.50196088387,1.034467010700811,[58404.390374565475,1.031110907311516,[65355.04630839962,1.030054220108931,[43958.656721146,1.0386353874831535,[48902.75800018717,1.0350026167891064,[45064.43172099421,1.0377023255444393,1005000.0]]

df = pd.DataFrame(sample,columns=['sev_est','sev_obs','_exp_up_b'])

解决方法

一种选择是使用多处理：

Command 'brew' not found

输出：

import pandas as pd
from scipy import stats
from time import time
from multiprocessing import Pool

def calc_exp(x,y,z):
    return stats.lognorm(scale=x,s=y).expect(lb=0,ub=z)

if __name__ == '__main__':
    # I have used the full sample_data but I copy here the first 3 rows only for readability
    sample_data = [[47752.06433069426,1.0357302794658065,1002500.0],[755000.4713730852,1.2872612468359987,[47752.06433069426,1001000.0]]

    df = pd.DataFrame(sample_data,columns=['sev_est','sev_obs','_exp_up_b'])
    df = pd.concat([df] * 10)  # artificially increasing the size of the data frame

    start = time()
    expect = df.apply(lambda row: calc_exp(row['sev_est'],row['sev_obs'],row['_exp_up_b']),axis=1)
    end = time()
    print(f'Single-threaded: {end - start} sec.')

    pool = Pool(processes=5)

    start = time()
    data = zip(*[df[col] for col in df])
    result = pool.starmap(calc_exp,data)
    pool.close()
    pool.join()
    end = time()
    print(f'Multi-threaded: {end - start} sec.')