多重处理中的进程使用map

如何解决多重处理中的进程使用map

我正在寻找N次微分方程，每次一组不同的参数。 multiprocessing因此听起来像正确的工作工具：让我们定义一个solve函数计算解决方案的实际工作并使用 multiprocessing.Pool与map一起将工作分配给几个程序。

这里的代码包含N=8和2个进程：

from multiprocessing import Pool
import time
import numpy as np

from bench import init_model,get_initial_solution

model = init_model()
sol_init = get_initial_solution(model,np.linspace(0,1,2),{"Current": 0.67})

Nsteps = 10
step_solver = model.default_solver

def solve(ind):
    st = time.time()
    step_solution = sol_init
    for step in range(0,Nsteps):
        step_solution = step_solver.step(
            step_solution,model,dt=1,npts=2,inputs={"Current": 2.0},save=False
    )
    return f"Task {ind} took {time.time() - st:.2f}s"


if __name__ == "__main__":
    with Pool(processes=2) as p:
        times = p.map(solve,np.arange(1,9))
        print("\n".join(times))

出于调试目的，solve不返回解决方案，但返回而不是函数在该过程中所花费的时间。

执行以上操作（我的计算机具有4个内核），我得到：

Task 1 took 4.41s
Task 2 took 5.59s
Task 3 took 1.67s
Task 4 took 0.62s
Task 5 took 0.61s
Task 6 took 0.72s
Task 7 took 0.68s
Task 8 took 0.53s

如您所见，函数solve中花费的时间差异很大在整个过程池中，并在广泛的价值范围内。请注意，这些结果不是确定性的。即，如果我执行了再次使用上面的脚本，将会有非常不同的时间观测到的。但是，没有任何理由会导致这种随机性在流程和执行过程中要完成的工作是相同的。

让我们简要介绍一下流程的执行情况，以获取更多信息有关在那里发生的事情的信息。

import cProfile
import pstats

def profile(ind):
    cProfile.runctx("solve(ind)",globals(),locals(),"report_"+str(ind)+".txt")

with Pool(processes=2) as p:
    times = p.map(profile,9))

for ind in range(1,9):
    stats = pstats.Stats("report_"+str(ind)+".txt").strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats(11)

例如，如果我们查看任务1和7的报告：

Thu Oct 29 18:07:18 2020    report_1.txt

         75858 function calls (75426 primitive calls) in 0.895 seconds

   Ordered by: cumulative time
   List reduced from 249 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.895    0.895 {built-in method builtins.exec}
        1    0.000    0.000    0.895    0.895 <string>:1(<module>)
        1    0.000    0.000    0.895    0.895 python-1mtcbY:15(solve)
       10    0.001    0.000    0.895    0.090 base_solver.py:712(step)
       10    0.001    0.000    0.892    0.089 scipy_solver.py:35(_integrate)
       10    0.003    0.000    0.889    0.089 ivp.py:156(solve_ivp)
       88    0.001    0.000    0.738    0.008 base.py:159(step)
       88    0.010    0.000    0.738    0.008 bdf.py:296(_step_impl)
       45    0.000    0.000    0.668    0.015 bdf.py:216(lu)
       45    0.666    0.015    0.668    0.015 decomp_lu.py:15(lu_factor)
     1284    0.003    0.000    0.139    0.000 base_solver.py:906(__call__)


Thu Oct 29 18:07:27 2020    report_7.txt

         75831 function calls (75399 primitive calls) in 6.773 seconds

   Ordered by: cumulative time
   List reduced from 244 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.773    6.773 {built-in method builtins.exec}
        1    0.000    0.000    6.773    6.773 <string>:1(<module>)
        1    0.000    0.000    6.773    6.773 python-1mtcbY:15(solve)
       10    0.001    0.000    6.773    0.677 base_solver.py:712(step)
       10    0.000    0.000    6.770    0.677 scipy_solver.py:35(_integrate)
       10    0.002    0.000    6.769    0.677 ivp.py:156(solve_ivp)
       88    0.001    0.000    6.612    0.075 base.py:159(step)
       88    0.011    0.000    6.612    0.075 bdf.py:296(_step_impl)
       45    0.000    0.000    6.520    0.145 bdf.py:216(lu)
       45    6.519    0.145    6.520    0.145 decomp_lu.py:15(lu_factor)
     1284    0.003    0.000    0.146    0.000 base_solver.py:906(__call__)

以上内容告诉我们，在两种情况下（所有情况都是这样任务），该过程将大部分时间都用在Scipy的内部 lu_factor。然而令人惊讶的是（至少令我惊讶） lu_factor中花费的时间因任务而异。这是我想了解的东西。

解决方法

大概是由于全局可变状态所致；前两个任务是“慢速”，对应于池中的进程数。

我建议通过移动使事情更具确定性

model = init_model()
sol_init = get_initial_solution(model,np.linspace(0,1,2),{"Current": 0.67})

在solve内

lu_factor是LAPACK getrf例程的包装，该例程本身依赖于BLAS。在我的情况下，基本的BLAS实现是OpenBlas，当使用pip从PyPI安装scipy时，在scipy轮内部提供。
正如Sam Mason所建议的，运行lu_factor会利用我的计算机上的四个内核，因为底层的OpenBLAS是多线程的。使用export OPENBLAS_NUM_THREADS=1为OpenBLAS禁用多线程在很大程度上解决了上述性能问题。实际上，再次使用2个进程运行同一脚本：

Task 1 took 0.19s
Task 2 took 0.19s
Task 3 took 0.18s
Task 4 took 0.18s
Task 5 took 0.18s
Task 6 took 0.18s
Task 7 took 0.18s
Task 8 took 0.18s

我注意到前两个任务始终比其他任务慢（大约10％），这可能与Sam Mason关于全球可变状态影响的原始答案有关，尽管我不完全理解为什么点。