Pytorch：将张量移至GPU时，内存会发生什么变化？

如何解决Pytorch：将张量移至GPU时，内存会发生什么变化？

我试图了解将张量发送到GPU时RAM和GPU内存会发生什么情况。

在下面的代码示例中，我创建了两个张量-大张量arr = torch.Tensor.ones（（10000，10000））和小张量c = torch.Tensor.ones（1）。张量c在目标函数步骤内发送到GPU，该步骤由multiprocessing.Pool调用。这样，每个子进程在GPU上使用487 MB，RAM使用量变为 5 GB 。请注意，大张量arr仅在调用Pool之前创建一次，而不作为参数传递给目标函数。当一切都在cpu上时，Ram的使用不会爆炸。

在此示例中，我有以下问题：

我正在将torch.Tensor.ones（1）发送到GPU，但它占用了487 MB的GPU内存。即使基础张量很小，CUDA也会在GPU上分配最少的内存量吗？ GPU内存对我来说不是问题，这只是我自己了解分配的方式。

问题在于RAM使用率。即使我向GPU发送了一个小张量，它似乎也为每个子进程（可能是固定的内存）复制了内存中的所有内容（大张量arr）。因此，当将张量发送到GPU时，会将哪些对象复制到固定内存？我在这里遗漏了一些东西，因为当我仅发送特定对象时，准备将所有内容发送到GPU并没有意义。

谢谢！

from multiprocessing import get_context
import time
import torch

dim = 10000
sleep_time = 2
npe = 4  # number of parallel executions

# cuda
if torch.cuda.is_available():
    dev = 'cuda:0'
else:
    dev = "cpu"
device = torch.device(dev)


def step(i):
    c = torch.ones(1)
    # comment the line below to see no memory increase
    c = c.to(device)
    time.sleep(sleep_time)


if __name__ == '__main__':
    arr = torch.ones((dim,dim))

    # create list of inputs to be executed in parallel
    inp = list(range(npe))

    # sleep added before and after launching multiprocessing to monitor the memory consumption
    print('before pool')  # to check memory with top or htop
    time.sleep(sleep_time)

    context = get_context('spawn')
    with context.Pool(npe) as pool:
        print('after pool')  # to check memory with top or htop
        time.sleep(sleep_time)

        pool.map(step,inp)

    time.sleep(sleep_time)

解决方法

我正在将torch.Tensor.ones(1)发送到GPU，但它占用了487 MB的GPU内存。即使基础张量很小，CUDA也会在GPU上分配最少的内存量吗？

CUDA设备运行时会在上下文建立时为各种事物保留内存，其中某些是固定大小的，有些是可变的，可以通过API调用进行控制（有关更多信息，请参见here）。完全正常的是，第一个显式或延迟地在设备上建立上下文的API调用会产生GPU内存消耗的跳跃。在这种情况下，我想象第一个张量创建正在触发此内存开销分配。这是CUDA运行时的属性，而不是PyTorch或张量。