如何解决PyTorch:辍学?会导致训练+验证的模型收敛不同V.仅训练
我们面临一个非常奇怪的问题。我们将完全相同的模型测试为两个不同的“执行”设置。在第一种情况下,在给定一定数量的时期的情况下,我们使用小批量训练一个时期,然后按照相同的标准对验证集进行测试。然后,我们进入下一个时代。显然,在每个训练纪元之前,我们都使用model.train(),在验证之前,我们要启用model.eval()。
然后,我们采用完全相同的模型(相同的init,相同的数据集,相同的时期等),并且在每个时期之后我们都对其进行训练而无需验证。
仅查看训练集的性能,我们就会发现,即使我们固定了所有种子,这两个训练过程也会以不同的方式发展,并产生完全不同的度量结果(损失,准确性等)。具体来说,仅训练过程的执行效果较差。
我们还观察到以下几点:
- 这不是可重复性问题,因为多次执行 相同的过程会产生完全相同的结果(这是 预期);
- 删除该缺失,看来问题消失了;
- Batchnorm1d层,它们之间仍然具有不同的行为 培训和评估,似乎工作正常;
- 如果我们从培训转移到TPU到CPU,问题仍然会发生。 我们正在尝试Pythorch 1.6,Pythorch每晚,XLA 1.6。
我们在解决这一问题上整整失去了一天(不,我们无法避免使用辍学)。有谁知道如何解决这个事实吗?
非常感谢您!
p.s。这里是用于培训的代码(在CPU上)。
def sigmoid(x):
return 1 / (1 + torch.exp(-x))
def _run(model,EPOCHS,training_data_in,validation_data_in=None):
def train_fn(train_dataloader,model,optimizer,criterion):
running_loss = 0.
running_accuracy = 0.
running_tp = 0.
running_tn = 0.
running_fp = 0.
running_fn = 0.
model.train()
for batch_idx,(ecg,spo2,labels) in enumerate(train_dataloader,1):
optimizer.zero_grad()
outputs = model(ecg)
loss = criterion(outputs,labels)
loss.backward() # calculate the gradients
torch.nn.utils.clip_grad_norm_(model.parameters(),0.5)
optimizer.step() # update the network weights
running_loss += loss.item()
predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model
running_accuracy += (predicted == labels).sum().item() / labels.size(0)
fp = ((predicted - labels) == 1.).sum().item()
fn = ((predicted - labels) == -1.).sum().item()
tp = ((predicted + labels) == 2.).sum().item()
tn = ((predicted + labels) == 0.).sum().item()
running_tp += tp
running_fp += fp
running_tn += tn
running_fn += fn
retval = {'loss':running_loss / batch_idx,'accuracy':running_accuracy / batch_idx,'tp':running_tp,'tn':running_tn,'fp':running_fp,'fn':running_fn
}
return retval
def valid_fn(valid_dataloader,criterion):
running_loss = 0.
running_accuracy = 0.
running_tp = 0.
running_tn = 0.
running_fp = 0.
running_fn = 0.
model.eval()
for batch_idx,labels) in enumerate(valid_dataloader,1):
outputs = model(ecg)
loss = criterion(outputs,labels)
running_loss += loss.item()
predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model
running_accuracy += (predicted == labels).sum().item() / labels.size(0)
fp = ((predicted - labels) == 1.).sum().item()
fn = ((predicted - labels) == -1.).sum().item()
tp = ((predicted + labels) == 2.).sum().item()
tn = ((predicted + labels) == 0.).sum().item()
running_tp += tp
running_fp += fp
running_tn += tn
running_fn += fn
retval = {'loss':running_loss / batch_idx,'fn':running_fn
}
return retval
# Defining data loaders
train_dataloader = torch.utils.data.DataLoader(training_data_in,batch_size=BATCH_SIZE,shuffle=True,num_workers=1)
if validation_data_in != None:
validation_dataloader = torch.utils.data.DataLoader(validation_data_in,shuffle=False,num_workers=1)
# Defining the loss function
criterion = nn.BCEWithLogitsLoss()
# Defining the optimizer
import torch.optim as optim
optimizer = optim.AdamW(model.parameters(),lr=3e-4,amsgrad=False,eps=1e-07)
# Training code
metrics_history = {"loss":[],"accuracy":[],"precision":[],"recall":[],"f1":[],"specificity":[],"accuracy_bis":[],"tp":[],"tn":[],"fp":[],"fn":[],"val_loss":[],"val_accuracy":[],"val_precision":[],"val_recall":[],"val_f1":[],"val_specificity":[],"val_accuracy_bis":[],"val_tp":[],"val_tn":[],"val_fp":[],"val_fn":[],}
train_begin = time.time()
for epoch in range(EPOCHS):
start = time.time()
print("EPOCH:",epoch+1)
train_metrics = train_fn(train_dataloader=train_dataloader,model=model,optimizer=optimizer,criterion=criterion)
metrics_history["loss"].append(train_metrics["loss"])
metrics_history["accuracy"].append(train_metrics["accuracy"])
metrics_history["tp"].append(train_metrics["tp"])
metrics_history["tn"].append(train_metrics["tn"])
metrics_history["fp"].append(train_metrics["fp"])
metrics_history["fn"].append(train_metrics["fn"])
precision = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fp"]) if train_metrics["tp"] > 0 else 0
recall = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fn"]) if train_metrics["tp"] > 0 else 0
specificity = train_metrics["tn"] / (train_metrics["tn"] + train_metrics["fp"]) if train_metrics["tn"] > 0 else 0
f1 = 2*precision*recall / (precision + recall) if precision*recall > 0 else 0
metrics_history["precision"].append(precision)
metrics_history["recall"].append(recall)
metrics_history["f1"].append(f1)
metrics_history["specificity"].append(specificity)
if validation_data_in != None:
# Calculate the metrics on the validation data,in the same way as done for training
with torch.no_grad(): # don't keep track of the info necessary to calculate the gradients
val_metrics = valid_fn(valid_dataloader=validation_dataloader,criterion=criterion)
metrics_history["val_loss"].append(val_metrics["loss"])
metrics_history["val_accuracy"].append(val_metrics["accuracy"])
metrics_history["val_tp"].append(val_metrics["tp"])
metrics_history["val_tn"].append(val_metrics["tn"])
metrics_history["val_fp"].append(val_metrics["fp"])
metrics_history["val_fn"].append(val_metrics["fn"])
val_precision = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fp"]) if val_metrics["tp"] > 0 else 0
val_recall = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fn"]) if val_metrics["tp"] > 0 else 0
val_specificity = val_metrics["tn"] / (val_metrics["tn"] + val_metrics["fp"]) if val_metrics["tn"] > 0 else 0
val_f1 = 2*val_precision*val_recall / (val_precision + val_recall) if val_precision*val_recall > 0 else 0
metrics_history["val_precision"].append(val_precision)
metrics_history["val_recall"].append(val_recall)
metrics_history["val_f1"].append(val_f1)
metrics_history["val_specificity"].append(val_specificity)
print(" > Training/validation loss:",round(train_metrics['loss'],4),round(val_metrics['loss'],4))
print(" > Training/validation accuracy:",round(train_metrics['accuracy'],round(val_metrics['accuracy'],4))
print(" > Training/validation precision:",round(precision,round(val_precision,4))
print(" > Training/validation recall:",round(recall,round(val_recall,4))
print(" > Training/validation f1:",round(f1,round(val_f1,4))
print(" > Training/validation specificity:",round(specificity,round(val_specificity,4))
else:
print(" > Training loss:",4))
print(" > Training accuracy:",4))
print(" > Training precision:",4))
print(" > Training recall:",4))
print(" > Training f1:",4))
print(" > Training specificity:",4))
print("Completed in:",round(time.time() - start,1),"seconds \n")
print("Training completed in:",round((time.time()- train_begin)/60,"minutes")
# Save the model weights
torch.save(model.state_dict(),'./nnet_model.pt')
# Save the metrics history
torch.save(metrics_history,'training_history')
下面是初始化模型并设置种子的函数,该函数在每次执行“ _run”的代码之前调用:
def reinit_model():
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)
net = Net() # the model
return net
解决方法
好的,我发现了问题。 该问题由以下事实决定:显然,运行评估会更改一些随机种子,这会影响训练阶段。
解决方案如下:
- 在函数“ _run()”的开头,将所有种子状态设置为所需的值,例如42。然后,将这些种子保存到磁盘。
- 在函数“ train_fn()”的开头,从磁盘读取种子状态,然后进行设置
- 在函数“ train_fn()”的末尾,将种子状态保存到磁盘中
例如,在具有XLA的TPU上运行,必须使用以下说明:
- 函数“ _run()”的开头: xm.set_rng_state(42), xm.save(xm.get_rng_state(),'xm_seed')
- 在函数“ train_fn()”的开头: xm.set_rng_state(torch.load('xm_seed'),device = device)(您也可以在此处打印种子以用于验证) xm.master_print(xm.get_rng_state())
- 函数“ train_fn_()”结尾: xm.save(xm.get_rng_state(),'xm_seed')
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。