PyTorch:辍学?会导致训练+验证的模型收敛不同V.仅训练

如何解决PyTorch:辍学?会导致训练+验证的模型收敛不同V.仅训练

我们面临一个非常奇怪的问题。我们将完全相同的模型测试为两个不同的“执行”设置。在第一种情况下,在给定一定数量的时期的情况下,我们使用小批量训练一个时期,然后按照相同的标准对验证集进行测试。然后,我们进入下一个时代。显然,在每个训练纪元之前,我们都使用model.train(),在验证之前,我们要启用model.eval()。

然后,我们采用完全相同的模型(相同的init,相同的数据集,相同的时期等),并且在每个时期之后我们都对其进行训练而无需验证。

仅查看训练集的性能,我们就会发现,即使我们固定了所有种子,这两个训练过程也会以不同的方式发展,并产生完全不同的度量结果(损失,准确性等)。具体来说,仅训练过程的执行效果较差。

我们还观察到以下几点:

  • 这不是可重复性问题,因为多次执行 相同的过程会产生完全相同的结果(这是 预期);
  • 删除该缺失,看来问题消失了;
  • Batchnorm1d层,它们之间仍然具有不同的行为 培训和评估,似乎工作正常;
  • 如果我们从培训转移到TPU到CPU,问题仍然会发生。 我们正在尝试Pythorch 1.6,Pythorch每晚,XLA 1.6。

我们在解决这一问题上整整失去了一天(不,我们无法避免使用辍学)。有谁知道如何解决这个事实吗?

非常感谢您!

p.s。这里是用于培训的代码(在CPU上)。

def sigmoid(x):
    return 1 / (1 + torch.exp(-x))


def _run(model,EPOCHS,training_data_in,validation_data_in=None):
    
    def train_fn(train_dataloader,model,optimizer,criterion):

        running_loss = 0.
        running_accuracy = 0.
        running_tp = 0.
        running_tn = 0.
        running_fp = 0.
        running_fn = 0.
        
        model.train()

        for batch_idx,(ecg,spo2,labels) in enumerate(train_dataloader,1):

            optimizer.zero_grad() 
                
            outputs = model(ecg)

            loss = criterion(outputs,labels)
                        
            loss.backward() # calculate the gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(),0.5)
            optimizer.step() # update the network weights
                                                
            running_loss += loss.item()
            predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model
            
            running_accuracy += (predicted == labels).sum().item() / labels.size(0)   
            
            fp = ((predicted - labels) == 1.).sum().item() 
            fn = ((predicted - labels) == -1.).sum().item()
            tp = ((predicted + labels) == 2.).sum().item()
            tn = ((predicted + labels) == 0.).sum().item()
            running_tp += tp
            running_fp += fp
            running_tn += tn
            running_fn += fn
            
        retval = {'loss':running_loss / batch_idx,'accuracy':running_accuracy / batch_idx,'tp':running_tp,'tn':running_tn,'fp':running_fp,'fn':running_fn
                }
            
        return retval
            

        
    def valid_fn(valid_dataloader,criterion):

        running_loss = 0.
        running_accuracy = 0.
        running_tp = 0.
        running_tn = 0.
        running_fp = 0.
        running_fn = 0.

        model.eval()
        
        for batch_idx,labels) in enumerate(valid_dataloader,1):

            outputs = model(ecg)

            loss = criterion(outputs,labels)
            
            running_loss += loss.item()
            predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model

            running_accuracy += (predicted == labels).sum().item() / labels.size(0)  
            
            fp = ((predicted - labels) == 1.).sum().item()
            fn = ((predicted - labels) == -1.).sum().item()
            tp = ((predicted + labels) == 2.).sum().item()
            tn = ((predicted + labels) == 0.).sum().item()
            running_tp += tp
            running_fp += fp
            running_tn += tn
            running_fn += fn
            
        retval = {'loss':running_loss / batch_idx,'fn':running_fn
                }
            
        return retval
    
    
    
    # Defining data loaders

    train_dataloader = torch.utils.data.DataLoader(training_data_in,batch_size=BATCH_SIZE,shuffle=True,num_workers=1)
    
    if validation_data_in != None:
        validation_dataloader = torch.utils.data.DataLoader(validation_data_in,shuffle=False,num_workers=1)


    # Defining the loss function
    criterion = nn.BCEWithLogitsLoss()
    
    
    # Defining the optimizer
    import torch.optim as optim
    optimizer = optim.AdamW(model.parameters(),lr=3e-4,amsgrad=False,eps=1e-07) 


    # Training code
    
    metrics_history = {"loss":[],"accuracy":[],"precision":[],"recall":[],"f1":[],"specificity":[],"accuracy_bis":[],"tp":[],"tn":[],"fp":[],"fn":[],"val_loss":[],"val_accuracy":[],"val_precision":[],"val_recall":[],"val_f1":[],"val_specificity":[],"val_accuracy_bis":[],"val_tp":[],"val_tn":[],"val_fp":[],"val_fn":[],}
    
    train_begin = time.time()
    for epoch in range(EPOCHS):
        start = time.time()

        print("EPOCH:",epoch+1)

        train_metrics = train_fn(train_dataloader=train_dataloader,model=model,optimizer=optimizer,criterion=criterion)
        
        metrics_history["loss"].append(train_metrics["loss"])
        metrics_history["accuracy"].append(train_metrics["accuracy"])
        metrics_history["tp"].append(train_metrics["tp"])
        metrics_history["tn"].append(train_metrics["tn"])
        metrics_history["fp"].append(train_metrics["fp"])
        metrics_history["fn"].append(train_metrics["fn"])
        
        precision = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fp"]) if train_metrics["tp"] > 0 else 0
        recall = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fn"]) if train_metrics["tp"] > 0 else 0
        specificity = train_metrics["tn"] / (train_metrics["tn"] + train_metrics["fp"]) if train_metrics["tn"] > 0 else 0
        f1 = 2*precision*recall / (precision + recall) if precision*recall > 0 else 0
        metrics_history["precision"].append(precision)
        metrics_history["recall"].append(recall)
        metrics_history["f1"].append(f1)
        metrics_history["specificity"].append(specificity)
        
        
        
        if validation_data_in != None:    
            # Calculate the metrics on the validation data,in the same way as done for training
            with torch.no_grad(): # don't keep track of the info necessary to calculate the gradients

                val_metrics = valid_fn(valid_dataloader=validation_dataloader,criterion=criterion)

                metrics_history["val_loss"].append(val_metrics["loss"])
                metrics_history["val_accuracy"].append(val_metrics["accuracy"])
                metrics_history["val_tp"].append(val_metrics["tp"])
                metrics_history["val_tn"].append(val_metrics["tn"])
                metrics_history["val_fp"].append(val_metrics["fp"])
                metrics_history["val_fn"].append(val_metrics["fn"])

                val_precision = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fp"]) if val_metrics["tp"] > 0 else 0
                val_recall = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fn"]) if val_metrics["tp"] > 0 else 0
                val_specificity = val_metrics["tn"] / (val_metrics["tn"] + val_metrics["fp"]) if val_metrics["tn"] > 0 else 0
                val_f1 = 2*val_precision*val_recall / (val_precision + val_recall) if val_precision*val_recall > 0 else 0
                metrics_history["val_precision"].append(val_precision)
                metrics_history["val_recall"].append(val_recall)
                metrics_history["val_f1"].append(val_f1)
                metrics_history["val_specificity"].append(val_specificity)


            print("  > Training/validation loss:",round(train_metrics['loss'],4),round(val_metrics['loss'],4))
            print("  > Training/validation accuracy:",round(train_metrics['accuracy'],round(val_metrics['accuracy'],4))
            print("  > Training/validation precision:",round(precision,round(val_precision,4))
            print("  > Training/validation recall:",round(recall,round(val_recall,4))
            print("  > Training/validation f1:",round(f1,round(val_f1,4))
            print("  > Training/validation specificity:",round(specificity,round(val_specificity,4))
        else:
            print("  > Training loss:",4))
            print("  > Training accuracy:",4))
            print("  > Training precision:",4))
            print("  > Training recall:",4))
            print("  > Training f1:",4))
            print("  > Training specificity:",4))


        print("Completed in:",round(time.time() - start,1),"seconds \n")

    print("Training completed in:",round((time.time()- train_begin)/60,"minutes")    

    
    
    # Save the model weights
    torch.save(model.state_dict(),'./nnet_model.pt')
    
    
    # Save the metrics history
    torch.save(metrics_history,'training_history')

下面是初始化模型并设置种子的函数,该函数在每次执行“ _run”的代码之前调用:

def reinit_model():
    torch.manual_seed(42)
    np.random.seed(42)
    random.seed(42)
    net = Net() # the model
    return net

解决方法

好的,我发现了问题。 该问题由以下事实决定:显然,运行评估会更改一些随机种子,这会影响训练阶段。

解决方案如下:

  • 在函数“ _run()”的开头,将所有种子状态设置为所需的值,例如42。然后,将这些种子保存到磁盘。
  • 在函数“ train_fn()”的开头,从磁盘读取种子状态,然后进行设置
  • 在函数“ train_fn()”的末尾,将种子状态保存到磁盘中

例如,在具有XLA的TPU上运行,必须使用以下说明:

  • 函数“ _run()”的开头: xm.set_rng_state(42) xm.save(xm.get_rng_state(),'xm_seed')
  • 在函数“ train_fn()”的开头: xm.set_rng_state(torch.load('xm_seed'),device = device)(您也可以在此处打印种子以用于验证) xm.master_print(xm.get_rng_state()
  • 函数“ train_fn_()”结尾: xm.save(xm.get_rng_state(),'xm_seed')

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res