PyTorch cuda 内存分配，为什么乘法比加法更耗内存

如何解决PyTorch cuda 内存分配，为什么乘法比加法更耗内存

看起来乘法复制了原始输入，而加法是在原地执行的。

有人可以帮我理解为什么吗？

复制：

import torch
import torch.nn as nn

x = torch.rand(32,256,100,100).cuda()
z = torch.rand(32,40).cuda()

batchnorm = nn.Batchnorm2d(256).cuda()
gamma = nn.Linear(in_features=40,out_features=256,bias=False).cuda()
beta = nn.Linear(in_features=40,bias=False).cuda()

print(f"Allocated mem: {torch.cuda.memory_stats('cuda')['allocated_bytes.all.current']/1e9:.3f} GB")
x = batchnorm(x)
print(f"Allocated mem: {torch.cuda.memory_stats('cuda')['allocated_bytes.all.current']/1e9:.3f} GB")
x = x + beta(z).view((-1,x.shape[1],1,1))
print(f"Allocated mem: {torch.cuda.memory_stats('cuda')['allocated_bytes.all.current']/1e9:.3f} GB")
x = x * gamma(z).view((-1,1))
print(f"Allocated mem: {torch.cuda.memory_stats('cuda')['allocated_bytes.all.current']/1e9:.3f} GB")

这两个操作的顺序只是为了清楚起见。两个操作翻转的结果是一样的。

输出：

Allocated mem: 0.328 GB
Allocated mem: 0.655 GB
Allocated mem: 0.655 GB
Allocated mem: 0.983 GB

谢谢