使用滑动窗口重塑 ndarray 而不使用太多内存的有效方法

如何解决使用滑动窗口重塑 ndarray 而不使用太多内存的有效方法

我必须通过应用两个滑动窗口将 [17205,21] 的 ndarray 重塑为 [17011,96,100,21]。

In: arr
Out: [[ 8.  0.  0. -0.  0.  0.  8.  8.  0.  0.  0.  0.  8.  7.  6.  9.  9.  1.
   1.  1.  2.]
 [ 8.  0.  0. -0.  0.  0.  8.  8.  0.  0.  0.  0.  8.  7.  5.  9.  8.  2.
   1.  1.  2.]
.
.
.
 [ 8.  0.  0. -0.  0.  0.  8.  8.  0.  0.  0.  0.  8.  7.  5.  9.  8.  3.
   1.  1.  2.]]

我的解决方案是对它应用两次滑动窗口。然后我两次应用以下方法：

def separate_multi(sequences,n_steps):
    X = list()
    for i in range(len(sequences)):
       # find the end of this pattern
       end_ix = i + n_steps
       # check if we are beyond the dataset
       if end_ix > len(sequences):
           break
            # gather input and output parts of the pattern
       seq_x = sequences[i:end_ix,:]           
       X.append(seq_x)
       return np.array(X)

给出 [17106,21] 的形状，然后再次使用 n_step=96，给出 [17011,21] 的形状。

缺点：它将整个数据存储在内存中，这会产生错误：

MemoryError: Unable to allocate 24.3 GiB for an array with shape (17011,20) and data type float64

一个可能的解决方案：

import tensorflow as tf
df = tf.data.Dataset.from_tensor_slices(df)
df = df.window(100,shift=1,stride=1,drop_remainder=True)
df = df.window(96,drop_remainder=True)

然而，它没有给我想要的输出，因为“它产生了一个嵌套窗口的数据集”，正如它所说的 here。

有什么想法吗？谢谢

解决方法

我找到了问题的解决方案：

主要问题不是分两步重塑数据，而是我通过重塑数据形成的对象的大小。因此，解决方案是将输入数组分解为多个部分。为此，我设计了以下功能：

def split_chunks(sequence,chunk=3000):
    list_seq = []
    for i in range(len(sequence)):
        if (i+1)*chunk > len(sequence):
            seq = sequence[i*chunk:-1,:]
            list_seq.append(seq)
            break
        else:
            seq = sequence[i*chunk:(i+1)*chunk,:]
            list_seq.append(seq)
    return list_seq

然后对 list_seq 内的每个数组进行整形。另一种选择是 NumPy 方法 np.split()，但是，我的函数比这个方法快 9 倍。