我的批量累积实施正确吗?

如何解决我的批量累积实施正确吗?

我想知道我用批处理累积模型训练的代码是否正确。特别是有关损失计算的部分,因为我不确定这是否是正确的方法。 这是我的代码:

def train (start_epochs,n_epochs,best_acc,train_generator,val_generator,model,optimizer,criterion,checkpoint_path,best_model_path):


#num_epochs = 25
  since = time.time()

  #best_model_wts = copy.deepcopy(model.state_dict())
  #best_acc = 0.0
  train_loss = []
  val_loss = []
  train_acc = []
  val_acc = []

  batch_accumulation = 8

  for epoch in tqdm(range(start_epochs,n_epochs+1)):

    running_train_loss = 0.0
    running_val_loss = 0.0

    running_train_corrects = 0
    running_val_corrects = 0

    optimizer.zero_grad
    #Training
    model.train()
    for i,(faces,labels) in tqdm(enumerate(train_generator)):
      
      faces = faces.to(device)
      labels = labels.to(device)

      #forward
      outputs = model(faces)

      #predictions of the model determined using the torch.max() function,which returns the index of the maximum value in a tensor.
      _,preds = torch.max(outputs[1],1)

      #pass the model outputs and the true image labels to the loss function
      loss = criterion(outputs[1],labels)
      #loss = loss / batch_accumulation
      running_train_loss += loss.item()
      # Backprop and Adam optimisation
      loss.backward()
      # Track the accuracy and loss
      running_train_corrects += torch.sum(preds == labels.data)

      if (i+1)% batch_accumulation == 0:
        optimizer.step()
        optimizer.zero_grad # zero the gradient buffers 
       
    # calculate average losses and accuracy  
    epoch_train_loss = running_train_loss / len(train_generator.dataset)
    epoch_train_acc = ((running_train_corrects.double() / len(train_generator.dataset)) * 100)
    train_loss.append(epoch_train_loss)
    train_acc.append(epoch_train_acc)

    print('Train Loss: {:.4f} Train Acc: {:.2f}%'.format(epoch_train_loss,epoch_train_acc))

    #Validation
    with torch.set_grad_enabled(False):
      model.eval()
      for i,(faces_val,labels_val) in tqdm(enumerate(val_generator)):

        faces_val = faces_val.to(device)
        labels_val = labels_val.to(device)
        
        if (i+1)% batch_accumulation == 0:

          outputs_val = model(faces_val)

          _,preds_val = torch.max(outputs_val[1],1)
          loss_val = criterion(outputs_val[1],labels_val)

          running_val_loss += loss_val.item() 
          #running_val_loss = running_val_loss +((1 /(i+1)) * (loss.item() - running_val_loss))
          running_val_corrects += torch.sum(preds_val == labels_val.data)

    # calculate average losses and accuracy 
    epoch_val_loss = running_val_loss / len(validation_generator.dataset)
    epoch_val_acc = (running_val_corrects.double() / len(validation_generator.dataset)) * 100
    val_loss.append(epoch_val_loss)
    val_acc.append(epoch_val_acc)

    print('Validation Loss: {:.4f} Validation Acc: {:.2f}%'.format(epoch_val_loss,epoch_val_acc))

我得到了奇怪的纪元训练结果(例如456.890),并且我注意到有关验证部分的if语句。

解决方法

您可能缺少括号

optimizer.zero_grad # zero the gradient buffers 

正确的呼叫方式是

optimizer.zero_grad()
,

在验证阶段无需使用gradient accumulation(实际用语),因此此处的这一部分:

if (i+1)% batch_accumulation == 0:    
    outputs_val = model(faces_val)

没有任何意义(不需要if)。该技术仅用于训练,以使小批量的梯度估计更加准确,因此我们应该重点研究它。

渐变累积

每次运行backward()时,将计算出的梯度添加到图的叶子中。通常,我们在整个批次中使用mean(将总和除以批次中的元素数量)。在这里,我们累积了损失,因此我们应该将其除以累积步骤数,从而得出(实际上您已经注释掉了):

loss = criterion(outputs[1],labels)
loss = loss / batch_accumulation

否则,损失可能会太大(可能就是这种情况),即使学习率很小,也会使网络不稳定。

您也可以运行以下命令:

running_train_loss += loss.item()

基于每次积累。

最后,正如@Dishin H Goyani zero_grad所指出的那样,您应该运行以下函数:

optimizer.zero_grad()

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res