使用 numpy 数组 Python 的内存错误

如何解决使用 numpy 数组 Python 的内存错误

我原来的 list_ 函数有超过 200 万行代码，当我运行计算 .有什么办法可以绕过它。下方的 list_ 是实际 numpy 数组的一部分。

熊猫数据：

import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url,low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']

代码：

number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,400.869995,394.773010,382.556000])

def rolling_window(a,window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1,window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a,shape=shape,strides=strides)

std = np.std(rolling_window(list_,number),axis=1)

错误信息：MemoryError: Unable to allocate 198. GiB for an array with shape (2659448,10000) and data type float64

错误信息的全长：

MemoryError                               Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
      5     return np.lib.stride_tricks.as_strided(a,strides=strides)
      6 
----> 7 std1 = np.std(rolling_window(PC_list,axis=1)

<__array_function__ internals> in std(*args,**kwargs)

C:\python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a,axis,dtype,out,ddof,keepdims)
   3495 
   3496     return _methods._std(a,axis=axis,dtype=dtype,out=out,ddof=ddof,-> 3497                          **kwargs)
   3498 
   3499 

C:\python3.7\lib\site-packages\numpy\core\_methods.py in _std(a,keepdims)
    232 def _std(a,axis=None,dtype=None,out=None,ddof=0,keepdims=False):
    233     ret = _var(a,--> 234                keepdims=keepdims)
    235 
    236     if isinstance(ret,mu.ndarray):

C:\python3.7\lib\site-packages\numpy\core\_methods.py in _var(a,keepdims)
    200     # Note that x may not be inexact and that we need it to be an array,201     # not a scalar.
--> 202     x = asanyarray(arr - arrmean)
    203 
    204     if issubclass(arr.dtype.type,(nt.floating,nt.integer)):

MemoryError: Unable to allocate 198. GiB for an array with shape (2659448,10000) and data type float64

解决方法

请我们参考您之前的相关问题（至少 2 个）。我碰巧记得看到过类似的东西，所以查了一下你之前的问题。

此外，在询问错误时，请显示完整的回溯（如果可能）。我们（和您）应该确定问题发生的位置，并缩小可能的原因和修复范围。

对于只有 (35,) 形状的样本 list_（为什么 numpy 数组的名称如此糟糕？），rolling_window 数组并没有那么大。另外它是一个view：

In [90]: x =rolling_window(list_,number)
In [91]: x.shape
Out[91]: (26,5)

然而，对该数组的操作可能会产生一个副本，从而增加内存使用。

在[96]中：np.std(x,axis=1) 出[96]：数组([22.67653383,10.3940773,14.60076482,13.82801944,13.68038469,12.54834004,... 8.07511323]) 在 [97] 中：_.shape 出[97]：(26,)

np.std 会：

std = sqrt(mean(abs(x - x.mean())**2))

x.mean(axis=1) 是每行一个值，但是

In [102]: x.mean(axis=1).shape
Out[102]: (26,)
In [103]: (x-x.mean(axis=1,keepdims=True)).shape
Out[103]: (26,5)
In [106]: (abs(x-x.mean(axis=1,keepdims=True))**2).shape
Out[106]: (26,5)

产生一个和x一样大的数组，并且是一个完整的副本；不是跨步虚拟副本。

错误消息形状有意义吗？ (2659448,10000) 你的 window 尺码是 10000 吗？而预期的窗口数是另一个值吗？

198. GiB 是给定该维度的合理数字：

In [94]: 2659448*10000*8/1e9
Out[94]: 212.75584

我不会使用足够大的数组来测试您的代码以产生内存错误。

as_strided 是一种生成移动窗口的好方法，而且速度很快 - 但它很容易增加内存使用量。

一般有两种处理“无法分配198GiB内存”的方法：

分块或逐行处理数据。

您的算法似乎适用于此；与其一次读取所有数据，不如重写 rolling_window 函数，使其加载初始窗口（文件的前 n 行），然后重复删除一行并从文件中读取一行。这样，您的内存永远不会超过 n 行，而且一切都会正常运行。

如果是本地文件，可以在整个计算过程中保持打开状态，这样最简单。如果是远程对象，您可能会发现连接超时；如果是这样，您可能需要将数据复制到本地文件，或使用相关的搜索/偏移参数为每个附加行（或您在本地缓冲的每个附加块）重新打开文件。
或者，购买（租用）一台内存超过 200 GiB 的机器；内存超过 1 TiB 的机器可在 AWS 上现成购买（大概是 GCP 和 Azure；或直接购买）。

如果您有理由确信您的需求不会进一步增长并且您只需要完成这一项工作，那么这尤其合适。这样做可以避免您重新编写代码来处理这个问题，但从长远来看，这不是一个可持续的解决方案。