如何解决我的批量累积实施正确吗?
我想知道我用批处理累积模型训练的代码是否正确。特别是有关损失计算的部分,因为我不确定这是否是正确的方法。 这是我的代码:
def train (start_epochs,n_epochs,best_acc,train_generator,val_generator,model,optimizer,criterion,checkpoint_path,best_model_path):
#num_epochs = 25
since = time.time()
#best_model_wts = copy.deepcopy(model.state_dict())
#best_acc = 0.0
train_loss = []
val_loss = []
train_acc = []
val_acc = []
batch_accumulation = 8
for epoch in tqdm(range(start_epochs,n_epochs+1)):
running_train_loss = 0.0
running_val_loss = 0.0
running_train_corrects = 0
running_val_corrects = 0
optimizer.zero_grad
#Training
model.train()
for i,(faces,labels) in tqdm(enumerate(train_generator)):
faces = faces.to(device)
labels = labels.to(device)
#forward
outputs = model(faces)
#predictions of the model determined using the torch.max() function,which returns the index of the maximum value in a tensor.
_,preds = torch.max(outputs[1],1)
#pass the model outputs and the true image labels to the loss function
loss = criterion(outputs[1],labels)
#loss = loss / batch_accumulation
running_train_loss += loss.item()
# Backprop and Adam optimisation
loss.backward()
# Track the accuracy and loss
running_train_corrects += torch.sum(preds == labels.data)
if (i+1)% batch_accumulation == 0:
optimizer.step()
optimizer.zero_grad # zero the gradient buffers
# calculate average losses and accuracy
epoch_train_loss = running_train_loss / len(train_generator.dataset)
epoch_train_acc = ((running_train_corrects.double() / len(train_generator.dataset)) * 100)
train_loss.append(epoch_train_loss)
train_acc.append(epoch_train_acc)
print('Train Loss: {:.4f} Train Acc: {:.2f}%'.format(epoch_train_loss,epoch_train_acc))
#Validation
with torch.set_grad_enabled(False):
model.eval()
for i,(faces_val,labels_val) in tqdm(enumerate(val_generator)):
faces_val = faces_val.to(device)
labels_val = labels_val.to(device)
if (i+1)% batch_accumulation == 0:
outputs_val = model(faces_val)
_,preds_val = torch.max(outputs_val[1],1)
loss_val = criterion(outputs_val[1],labels_val)
running_val_loss += loss_val.item()
#running_val_loss = running_val_loss +((1 /(i+1)) * (loss.item() - running_val_loss))
running_val_corrects += torch.sum(preds_val == labels_val.data)
# calculate average losses and accuracy
epoch_val_loss = running_val_loss / len(validation_generator.dataset)
epoch_val_acc = (running_val_corrects.double() / len(validation_generator.dataset)) * 100
val_loss.append(epoch_val_loss)
val_acc.append(epoch_val_acc)
print('Validation Loss: {:.4f} Validation Acc: {:.2f}%'.format(epoch_val_loss,epoch_val_acc))
我得到了奇怪的纪元训练结果(例如456.890),并且我注意到有关验证部分的if语句。
解决方法
您可能缺少括号
optimizer.zero_grad # zero the gradient buffers
正确的呼叫方式是
optimizer.zero_grad()
,
在验证阶段无需使用gradient accumulation
(实际用语),因此此处的这一部分:
if (i+1)% batch_accumulation == 0:
outputs_val = model(faces_val)
没有任何意义(不需要if
)。该技术仅用于训练,以使小批量的梯度估计更加准确,因此我们应该重点研究它。
渐变累积
每次运行backward()
时,将计算出的梯度添加到图的叶子中。通常,我们在整个批次中使用mean
(将总和除以批次中的元素数量)。在这里,我们累积了损失,因此我们应该将其除以累积步骤数,从而得出(实际上您已经注释掉了):
loss = criterion(outputs[1],labels)
loss = loss / batch_accumulation
否则,损失可能会太大(可能就是这种情况),即使学习率很小,也会使网络不稳定。
您也可以运行以下命令:
running_train_loss += loss.item()
基于每次积累。
最后,正如@Dishin H Goyani zero_grad
所指出的那样,您应该运行以下函数:
optimizer.zero_grad()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。