如何使用 GPU 并行训练 tensorflow.keras 模型？ TensorFlow 2.5.0 版

如何解决如何使用 GPU 并行训练 tensorflow.keras 模型？ TensorFlow 2.5.0 版

我有以下代码运行我在不同模块中拥有的自定义模型，并将几个参数（学习率、卷积核大小等）作为输入

custom_model 是一个在 tensorflow 中编译 tensorflow.keras.models.Model 并返回模型的函数。

LOW 是训练数据集
HIGH 是目标数据集

我通过 hdf5 文件加载了它们，但数据集相当大，大约 10 GB。

通常我在 jupyter-lab 中运行它没有任何问题，并且模型不会消耗 GPU 上的资源。最后我保存了不同参数的权重。

现在我的问题是：

如何将其作为脚本并针对 k1 和 k2 的不同值并行运行。我想像 bash 循环这样的事情会做，但我想避免重新读取数据集。我使用 Windows 10 作为操作系统。

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
    tf.config.experimental.set_memory_growth(gpu_instance,True)
import h5py

from model_custom import custom_model
winx = 100
winz = 10
k1 = 9
k2 = 5

with h5py.File('MYFILE','r') as hf:
    LOW = hf['LOW'][:]
    HIGH = hf['HIGH'][:]

with tf.device("/gpu:1"):
    mymodel = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k1,kz2=k2)
    myhistory = mymodel.fit(LOW,HIGH,batch_size=1,epochs=1)
    mymodel.save_weights('zkernel_{}_kz1_{}_kz2_{}.hdf5'.format(winz,k1,k2))

解决方法

我发现这个解决方案对我来说很好用。这允许使用带有 mpi4py 的 MPI 在 gpu 中运行并行模型训练。当我尝试加载大文件并同时运行多个进程时，只有一个问题会导致我加载的数据的进程数超过我的 ram 容量。

from mpi4py import MPI 
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
    tf.config.experimental.set_memory_growth(gpu_instance,True)
import h5py
from model_custom import custom_model

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

winx = 100
winy = 100
winz = 10

if rank == 10:
    with h5py.File('mifile.hdf5','r') as hf:
        LOW = hf['LOW'][:]
        HIGH = hf['HIGH'][:]
else:
    HIGH = None
    LOW= None
HIGH = comm.bcast(HIGH,root=10)
LOW = comm.bcast(LOW,root=10)
    
if rank < 5:
    with tf.device("/gpu:1"):
        k = 9
        q = rank +1
        mymodel1 = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k,kz2=q)
        mymodel1._name = '{}_{}_{}'.format(winz,k,q)
        myhistory1 = mymodel1.fit(LOW,HIGH,batch_size=1,epochs=1)
        mymodel1.save_weights(mymodel1.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz,q))

elif 5 <= rank < 10: 
    with tf.device("/gpu:2"):
        k = 8
        q = rank +1 -5
        mymodel2 = custom_model(winx,kz2=q)
        mymodel2._name = '{}_{}_{}'.format(winz,q)
        myhistory2 = mymodel2.fit(LOW,epochs=1)
        mymodel2.save_weights(mymodel2.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz,q))

然后我保存到一个名为 mycode.py 的 python 模块，然后我在控制台中运行

mpiexec -n 11 python ./mycode.py

如何使用 GPU 并行训练 tensorflow.keras 模型？ TensorFlow 2.5.0 版

如何解决如何使用 GPU 并行训练 tensorflow.keras 模型？ TensorFlow 2.5.0 版

解决方法

相关推荐