如何解决尝试加速统计scipy计算失败
我有一个包含 30000 行的数据框,我正在尝试计算以下内容:
def calc_exp(x,y,z):
return stats.lognorm(scale = x,s=y).expect(lb=0,ub=z)
原则上使用for循环太慢了,但事实证明例如使用apply:
df['expect'] = df.apply(lambda row: calc_exp(row['sev_est'],row['sev_obs'],row['_exp_up_b']),axis=1)
但是对于这么大的数据帧,运行大约需要 30 分钟。
我也尝试使用 swifter:https://pypi.org/project/swifter/ 但不幸的是它只是超时,所以我不知道是否真的有效。
当我尝试这个时:
df['expect'] = df.swifter.apply(lambda row: calc_exp(row['sev_est'],axis=1)
它告诉我:
ValueError: The truth value of a Series is ambiguous. Use a.empty,a.bool(),a.item(),a.any() or a.all().
和
TimeoutError: [Errno 60] Operation timed out
您有什么建议或资源可以帮助我加快代码速度吗?
感谢您检查我的问题。
以下是数据示例:
sample_data = [[47752.06433069426,1.0357302794658065,1002500.0],[755000.4713730852,1.2872612468359987,[47752.06433069426,1001000.0],[57777.829574251584,1.0312698002906302,505000.0],[69703.2113095638,1.0299427372756402,2010000.0],[45136.11776248943,1.0376444095922805,[59132.70813622853,1.0309407576584755,1005000.0],[43453.5613190105,1.0390872317110278,[46135.3578194443,1.0368683857223946,[152082.89966620493,1.113359803894905,[750446.5937632995,1.2874160597603732,2002500.0],[53647.95567675417,1.0326342806970585,[45632.05708701799,1.0372520411746433,[46581.28690183377,1.0365403645539104,[54020.70347965895,1.0324868172323447,[44245.90842544985,1.0383857082697099,[51162.12793486834,1.0337556400084722,3025000.0],[107107.86948225822,1.0383712653241755,[722119.4508688038,1.2884596680530922,2025000.0],[699903.7852476649,1.2893820587789995,[48950.10419174958,1.034974068892171,[51738.02683120212,1.033473643110517,[42901.866305524214,1.0395999387162294,[52614.717261016325,1.0330705313839323,[713225.3474413318,1.2888174754031487,[57238.01599926183,1.0314162657284809,5010000.0],[58322.385654926955,1.0311310363507877,2005000.0],[54045.05749769092,1.0324773610543825,[42604.59804488991,1.039884727964432,1010000.0],[92437.93072760757,1.0336932004002017,1015000.0],[88559.61192571945,1.0326806977341927,[602239.5164807418,1.2947510218453246,[54815.71051455691,1.0321892223847444,[75898.55658703072,1.0303477696462269,[47915.02339498793,1.0356232022658125,252500.0],[44569.2886493234,1.0381108112654438,502500.0],[56189.65757026716,1.0317269545642969,[48074.68313484722,1.0355196014148147,[44310.501871381675,1.0383302791240467,5015000.0],[45861.05802340964,1.0370756881518186,[131930.89613964988,1.0481212391831674,1035000.0],[49212.10735631669,1.0348180280571426,[57489.35708029983,1.0313469517526663,[45341.39170748647,1.0374802534677299,[50560.21189377367,1.0340654819168598,[43881.69750318484,1.0387031703640284,[45265.59784677777,1.0375405752126938,[47415.20348405406,1.035955953776894,[55740.86462284447,1.0318709081324582,[45220.34890828568,1.037576748920849,[48535.991864874784,1.035227436919951,[127047.67786919193,1.0460887972573536,2100000.0],[46492.46532127476,1.0366048236011984,[541310.5576520629,1.2994907072560145,[799382.0622031174,1.2859240295491776,3010000.0],[44719.592998634034,1.037985238438699,[49927.12596481778,1.0344085721774596,[49663.66474485245,1.0345566887409923,[46877.59739500191,1.0363284400408517,[49255.78753338048,1.0347923305430622,[49822.50196088387,1.034467010700811,[58404.390374565475,1.031110907311516,[65355.04630839962,1.030054220108931,[43958.656721146,1.0386353874831535,[48902.75800018717,1.0350026167891064,[45064.43172099421,1.0377023255444393,1005000.0]]
df = pd.DataFrame(sample,columns=['sev_est','sev_obs','_exp_up_b'])
解决方法
一种选择是使用多处理:
Command 'brew' not found
输出:
import pandas as pd
from scipy import stats
from time import time
from multiprocessing import Pool
def calc_exp(x,y,z):
return stats.lognorm(scale=x,s=y).expect(lb=0,ub=z)
if __name__ == '__main__':
# I have used the full sample_data but I copy here the first 3 rows only for readability
sample_data = [[47752.06433069426,1.0357302794658065,1002500.0],[755000.4713730852,1.2872612468359987,[47752.06433069426,1001000.0]]
df = pd.DataFrame(sample_data,columns=['sev_est','sev_obs','_exp_up_b'])
df = pd.concat([df] * 10) # artificially increasing the size of the data frame
start = time()
expect = df.apply(lambda row: calc_exp(row['sev_est'],row['sev_obs'],row['_exp_up_b']),axis=1)
end = time()
print(f'Single-threaded: {end - start} sec.')
pool = Pool(processes=5)
start = time()
data = zip(*[df[col] for col in df])
result = pool.starmap(calc_exp,data)
pool.close()
pool.join()
end = time()
print(f'Multi-threaded: {end - start} sec.')
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。