colab 单元执行后仍然存在内存泄漏

如何解决colab 单元执行后仍然存在内存泄漏

我遇到了细微的内存泄漏，无法使用 tracemalloc 确定来源。我在 google colab 中运行以下代码，该代码旨在优化自定义 ppo 代理的超参数。此外，泄漏发生的速度各不相同：有时它会在运行时间的 10-20 分钟/5-10 次迭代内发生，而有时可能需要多达 50 次迭代/几个小时。这是一个带有完整代码的 colab notebook。

import os
from time import perf_counter

import numpy as np
import optuna
import pandas as pd
import xagents
from tensorflow.keras.optimizers import Adam
from xagents import PPO
from xagents.utils.common import ModelReader,create_envs,write_from_dict


def get_hparams(trial):
    return {
        'n_steps': int(
            trial.suggest_categorical('n_steps',[2 ** i for i in range(2,11)])
        ),'gamma': trial.suggest_loguniform('gamma',0.9,0.9999),'learning_rate': trial.suggest_loguniform('learning_rate',1e-5,1e-2),'epsilon': trial.suggest_loguniform('epsilon',1e-7,1e-1),'beta_1': trial.suggest_loguniform('beta_1',0.01,0.999),'beta_2': trial.suggest_loguniform('beta_2','entropy_coef': trial.suggest_loguniform('entropy_coef',1e-8,2e-1),'n_envs': int(
            trial.suggest_categorical('n_envs',[2 ** i for i in range(4,7)])
        ),'grad_norm': trial.suggest_uniform('grad_norm',0.1,10.0),'lam': trial.suggest_loguniform('lam',0.65,0.99),'advantage_epsilon': trial.suggest_loguniform('advantage_epsilon','clip_norm': trial.suggest_loguniform('clip_norm',10),}


def optimize_agent(trial,seed=55):
    hparams = get_hparams(trial)
    envs = create_envs('BreakoutNoFrameskip-v4',hparams['n_envs'])
    model_cfg = xagents.agents['ppo']['model']['cnn'][0]
    optimizer = Adam(
        hparams['learning_rate'],epsilon=hparams['epsilon'],beta_1=hparams['beta_1'],beta_2=hparams['beta_2'],)
    model = ModelReader(
        model_cfg,[envs[0].action_space.n,1],envs[0].observation_space.shape,optimizer,seed=seed,).build_model()
    agent = PPO(
        envs,model,entropy_coef=hparams['entropy_coef'],grad_norm=hparams['grad_norm'],gamma=hparams['gamma'],n_steps=hparams['n_steps'],lam=hparams['lam'],advantage_epsilon=hparams['advantage_epsilon'],clip_norm=hparams['clip_norm'],)
    steps = 150000
    agent.fit(max_steps=steps)
    current_rewards = np.around(np.mean(agent.total_rewards),2)
    if not np.isfinite(current_rewards):
        current_rewards = 0
    hparams[f'mean_after_{steps}_steps'] = current_rewards
    write_from_dict(
        {key: [val] for (key,val) in hparams.items()},'ppo-optuna.parquet'
    )
    trial.report(current_rewards,1000)
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    frame = pd.DataFrame(
        [
            {'traceback': item.traceback,'size': item.size,'count': item.count}
            for item in top_stats
        ]
    )
    frame.to_csv(f'memory-snapshots/snapshot-{perf_counter()}.csv',index=False)
    return current_rewards

这是我运行的：

os.mkdir('memory-snapshots')
study = optuna.create_study(
    study_name='ppo-trials150',load_if_exists=True,storage="sqlite:///example-ppo.db",direction='maximize',)
study.optimize(
    optimize_agent,n_trials=1000,show_progress_bar=True,gc_after_trial=True
)

这会导致内存快速膨胀，直到会话崩溃。

如果我在单元崩溃之前停止它，内存问题会一直存在，直到运行时重新启动。

这里是直到崩溃前 15 次连续迭代的结果 memory snapshots，没有显示任何特定的危险信号。此外，通过总结最近快照的size列，它总共967249710字节~= 1GB，这很奇怪，因为colab上的可用内存~= 12GB。以下是前 23 条回溯：

这是崩溃日志之一：

Timestamp                   Level   Message
Jul 13,2021,9:58:11 AM    WARNING WARNING:root:kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f restarted
Jul 13,9:58:11 AM    INFO    KernelRestarter: restarting kernel (1/5),keep random ports
Jul 13,9:22:48 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13,9:21:30 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13,9:21:27 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13,9:20:08 AM    WARNING 2021-07-13 07:20:08.307941: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure,but may mean that there Could be performance gains if more memory were available.
Jul 13,9:20:08 AM    WARNING 2021-07-13 07:20:08.307851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:08 AM    WARNING 2021-07-13 07:20:08.221051: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:08 AM    WARNING 2021-07-13 07:20:08.220943: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:06 AM    WARNING 2021-07-13 07:20:06.646798: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:06 AM    WARNING 2021-07-13 07:20:06.646709: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:06 AM    WARNING 2021-07-13 07:20:06.122847: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:06 AM    WARNING 2021-07-13 07:20:06.122724: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure,9:20:03 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13,9:20:01 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55833a870000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13,9:18:49 AM    WARNING 2021-07-13 07:18:49.736991: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13,9:18:47 AM    WARNING 2021-07-13 07:18:47.153482: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13,9:18:23 AM    WARNING 2021-07-13 07:18:23.372605: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8004
Jul 13,9:18:21 AM    WARNING 2021-07-13 07:18:21.045353: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13,9:18:18 AM    WARNING 2021-07-13 07:18:18.540742: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] cpu Frequency: 2199995000 Hz
Jul 13,9:18:18 AM    WARNING 2021-07-13 07:18:18.470990: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.026567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13837 MB memory) -> physical GPU (device: 0,name: Tesla T4,pci bus id: 0000:00:04.0,compute capability: 7.5)
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.026490: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.025754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),but there must be at least one NUMA node,so returning NUMA node zero
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.024324: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:53 AM    WARNING 2021-07-13 07:17:53.022506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:53 AM    WARNING 2021-07-13 07:17:53.021712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.021041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
Jul 13,9:17:53 AM    WARNING 2021-07-13 07:17:53.019732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.878647: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.875397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.874569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING 2021-07-13 07:17:47.873445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
Jul 13,9:17:47 AM    WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.873293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.872460: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING 2021-07-13 07:17:47.871015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.867034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING 2021-07-13 07:17:47.865768: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING 2021-07-13 07:17:47.865559: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.860081: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.843825: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.552444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.509098: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.374421: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.374270: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.232497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.232421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13,9:17:47 AM    WARNING 2021-07-13 07:17:47.231418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),9:17:47 AM    WARNING 2021-07-13 07:17:47.168074: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
Jul 13,9:17:31 AM    WARNING 2021-07-13 07:17:31.798739: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13,9:13:58 AM    INFO    Uploading file to /content/xagents.tar.gz
Jul 13,9:13:33 AM    INFO    Adapting to protocol v5.1 for kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13,9:13:32 AM    INFO    Kernel started: 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13,9:13:24 AM    INFO    Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 13,9:13:24 AM    INFO    http://172.28.0.12:9000/
Jul 13,9:13:24 AM    INFO    The Jupyter Notebook is running at:
Jul 13,9:13:24 AM    INFO    0 active kernels
Jul 13,9:13:24 AM    INFO    Serving notebooks from local directory: /
Jul 13,9:13:24 AM    INFO    google.colab serverextension initialized.
Jul 13,9:13:24 AM    INFO    http://172.28.0.2:9000/
Jul 13,9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Jul 13,9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret