如何像在 Python 2.7 上一样快地获取此 Python 3 代码？

如何解决如何像在 Python 2.7 上一样快地获取此 Python 3 代码？

这是一个简单的 Python 中的 linear congruential generator：

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271 % 0x7fffffff
        yield n

g = prng(123)
for i in range(10**8):
    next(g)

print(next(g))

Python 2.7 在这里明显更快。相比之下，Python 3.9 中的运行时间降低了 110-115%（macbook air 上的自制 cpythons）。产生 1 亿个术语：

$ python2 -V
Python 2.7.16
$ python3 -V
Python 3.9.1
$ time python2 g.py
1062172093
python2 g.py  11.31s user 0.43s system 99% cpu 11.759 total
$ time python3 g.py
1062172093
python3 g.py  24.48s user 0.04s system 99% cpu 24.549 total

为什么 cpython 3.x 解释器在执行此代码时如此慢？有没有办法让它与 2.7 的运行时间相提并论？

我不是在寻找使用编译的答案 - JIT、PyPy、cython、numba 等超出范围。使用 numpy 很好，或者任何方式来说服 cpython 使用固定大小的 uint（如果 stdlib big int 是效率低下的根源）。

解决方法

我没有 py2 可以玩，所以下面的基准测试只是比较 py3 中不同的实现细节。所有基准测试都是在使用 time.process_time 运行 Python 3.8.8 内核的 IPython 7.22.0 中完成的。每次跑步我都拿了中间的三遍。结果在大约 1 秒或约 3% 的准确度内有意义。

原始代码，循环耗时 35.36 秒。

您可以将所有数字转换为适当的固定宽度的 numpy 类型。这样，您就可以避免将所有 python 2 固定宽度整数隐式转换为 python 3 无限精度整数：

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    a = np.uint64(48271)
    b = np.uint64(0x7fffffff)
    n = np.uint64(n)
    while True:
        n = n * a % b
        yield n

g = prng(123)
p = process_time()
for i in range(10**8):
    next(g)
q = process_time()
print(q - p,':',next(g))

运行时间减少到 28.05 秒：下降了约 21%。顺便说一句，使用全局 a 和 b 仅将时间减少约 5% 至 33.55 秒。

作为 @Andrej Kesely suggested，模拟 py2 的固定宽度整数的更好方法是在 py3 中使用 float，而不是每次都调用 numpy 的调度机制：

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271.0 % 2147483647.0
        yield n

g = prng(123.0)
p = process_time()
for i in range(10**8):
    next(g)
q = process_time()
print(q - p,next(g))

事实上，我们看到运行时间为 23.63 秒，比原来减少了 33%。

为了绕过生成器 API，让我们在没有生成器的情况下重写循环：

n = 123
p = process_time()
for i in range(10**8):
    n = n * 48271 % 0x7fffffff
q = process_time()
print(q - p,n * 48271 % 0x7fffffff)

这个运行时间只有 26.28 秒，提升了约 26%。

做同样的事情，但使用函数调用只会节省大约 3%（34.33 秒的运行时间）：

def prng(n):
    return n * 48271 % 0x7fffffff

n = 123
p = process_time()
for i in range(10**8):
    n = prng(n)
q = process_time()
print(q - p,prng(n))

使用 float 可以像生成器一样加速函数版本：

def prng(n):
    return n * 48271.0 % 2147483647.0

n = 123.0
p = process_time()
for i in range(10**8):
    n = prng(n)
q = process_time()
print(q - p,prng(n))

22.97 秒的运行时间是额外的 33% 下降，就像我们在生成器中看到的一样。

使用 float 运行仅循环解决方案也有很大帮助：

n = 123.0
p = process_time()
for i in range(10**8):
    n = n * 48271.0 % 2147483647.0
q = process_time()
print(q - p,n * 48271.0 % 2147483647.0)

运行时间为 12.72 秒，比原始版本下降 64%，比 int 循环版本下降 52%。

显然，数据类型是这里缓慢的一个重要来源，但也很可能 python 3 的生成器机制也增加了 20% 左右的运行时间。消除这两个缓慢的来源使我们能够获得比原始代码运行时间短一半的结果。

目前尚不清楚去除无限精度类型后的剩余部分有多少是由生成器与 for 循环机制引起的。因此，让我们摆脱 for 循环，看看会发生什么：

from itertools import islice
from collections import deque

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271 % 0x7fffffff
        yield n

g = prng(123)
p = process_time()
deque(islice(g,10**8),maxlen=0)
q = process_time()
print(q - p,next(g))

运行时间为 21.32 秒，比原始代码快 40%，表明 for 实现可能变得更加健壮，因此在 py3 中也更加繁琐。

在 float 中使用 prng 会变得更好（与第一个示例完全一样）。现在运行时间是 10.09 秒，下降了 71%，比原始代码快了大约 3 倍。

另一个可测试的区别，suggested by @chepner 是在 py2 中，range(10**8) 等价于 py3 中的 list(range(10**8))。这很重要，因为生成器在 py3 中似乎更慢。

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271.0 % 2147483647.0
        yield n

g = prng(123.0)
r = list(range(10**8))
p = process_time()
for i in r:
    next(g)
q = process_time()
print(q - p,next(g))

此版本耗时 20.62 秒，比生成 range 的相同代码快约 13%，比原始代码快 42%。很明显，发电机机械也是一个重要因素。