如何在GPU上正确运行model.fit? 异常行为

如何解决如何在GPU上正确运行model.fit? 异常行为

目前,我正在上Udemy Python数据科学课程。在那里,有以下示例在Tensorflow中训练模型:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

model = Sequential()

# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.compile(loss='binary_crossentropy',optimizer='adam')

model.fit(x=X_train,y=y_train,epochs=3,validation_data=(X_test,y_test),verbose=1
          )

我现在的目标是让它在我的GPU上运行。为此,我对最后一部分进行了如下更改(这些时间段是有目的的,我只想查看每个时间段在扩展之前需要花费多长时间):

with tf.device("/gpu:0"):
    model.fit(x=X_train,verbose=1
              )

为了进行比较,也如下:

with tf.device("/cpu:0"):
    model.fit(x=X_train,verbose=1
              )

但是,结果是非常出乎意料的:这两个版本都占用了GPU的所有内存,但似乎不对其进行任何计算,并且每个时期花费的时间完全相同。或者,GPU版本仅因以下错误而崩溃:

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value,from_value)
 
InternalError:  Blas GEMM launch failed : a.shape=(32,78),b.shape=(78,m=32,n=78,k=78
     [[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
 
Function call stack:
distributed_function

有时它会崩溃,有时它可以工作,但所需的时间与CPU一样长。有时,甚至CPU版本每个时期也要花费20秒,而其他时候则需要40秒。代码保持不变,所不同的是,我在两者之间重新启动了内核。我真的不明白。

当我使用以下代码测试GPU和conda环境时,一切似乎都可以正常工作,可重现,并且GPU的运行速度大约是CPU的20倍:

# https://     gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
 
def test_device (device_name: str):
    shape = (int(10000),int(10000))
    startTime = datetime.now()
    with tf.device(device_name):
        random_matrix = tf.random.uniform(shape=shape,minval=0,maxval=1)
        dot_operation = tf.matmul(random_matrix,tf.transpose(random_matrix))
        sum_operation = tf.reduce_sum(dot_operation)
 
    result = sum_operation
 
    print("Shape:",shape,"Device:",device_name)
    print("—"*50)
    print(result)
    print("Time taken:",datetime.now() - startTime)
    print("\n" * 2)
    
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec

所以,我确定我做错了什么。

TLTR:

在GPU上调用model.fit的正确方法是什么?

>在不更改代码的情况下,不同的运行如何导致截然不同的结果(崩溃,大大不同的计算时间)?

非常感谢您的帮助!

解决方法

经过反复尝试,我终于找到了一种强制CPU或“混合使用”的工作方法。不过,GPU似乎似乎不起作用。我的原始帖子中的with tf.device()方法在这种情况下似乎没有任何作用。如果只想使用CPU,我必须隐藏GPU(Tensorflow 2.1.0):

仅CPU

import React from 'react';
import { firebaseApp } from '../utils/firebase';
import { browserHistory } from 'react-router';
import Helmet from "react-helmet";

import RaisedButton from 'material-ui/RaisedButton';
import TextField from 'material-ui/TextField';
import Paper from 'material-ui/Paper';

import Avatar from '@material-ui/core/Avatar';
import Button from '@material-ui/core/Button';
import CssBaseline from '@material-ui/core/CssBaseline';
import FormControlLabel from '@material-ui/core/FormControlLabel';
import Checkbox from '@material-ui/core/Checkbox';
import Link from '@material-ui/core/Link';
import Grid from '@material-ui/core/Grid';
import LockOutlinedIcon from '@material-ui/icons/LockOutlined';
import Typography from '@material-ui/core/Typography';
import Container from '@material-ui/core/Container';
import withStyles from "@material-ui/core/styles/withStyles";

const styles = {
  paper: {
    display: 'flex',flexDirection: 'column',alignItems: 'center',},avatar: {
    backgroundColor: 'orange',form: {
    width: '100%',// Fix IE 11 issue.
  },submit: {
  },};

class Signup extends React.Component {
  constructor(props) {
    super(props);

    this.state = {
      email: '',password: '',emailError: '',passwordError: ''
    };

    this.handleSubmit = this.handleSubmit.bind(this);
    this.handlePasswordChange = this.handlePasswordChange.bind(this);
    this.handleEmailChange = this.handleEmailChange.bind(this);
  }

  handleEmailChange(e) {
    this.setState({ email: e.target.value });
  }

  handlePasswordChange(e) {
    this.setState({ password: e.target.value });
  }

  handleSubmit(e) {
    e.preventDefault();
    const email = this.state.email.trim();
    const password = this.state.password.trim();

    firebaseApp.auth().createUserWithEmailAndPassword(email,password).then((user) => {
      browserHistory.push('/polls/dashboard');
    }).catch((error) => {
      if (error.code === 'auth/weak-password') {
        this.setState({ passwordError: error.message,emailError: '' });
      } else {
        this.setState({ emailError: error.message,passwordError: '' });
      }
      //console.log(error);
    });
  }

  render() {
    const { classes } = this.props; //<----- grab classes here in the props

    return (
        <Container component="main" maxWidth="xs">
          <CssBaseline />

          <div className={classes.paper}>

            <Avatar className={classes.avatar}>
              <LockOutlinedIcon />
            </Avatar>
            <Typography component="h1" variant="h5">
              Sign up
            </Typography>


            <form className={classes.form} noValidate>
              <Grid container spacing={2}>
                <Grid item xs={12}>
                  <TextField
                      variant="outlined"
                      required
                      fullWidth
                      id="email"
                      label="Email Address"
                      name="email"
                      autoComplete="email"
                      floatingLabelText="Email"
                      value={this.state.email}
                      onChange={this.handleEmailChange}
                      errorText={this.state.emailError}
                  />
                </Grid>
                <br /><br />

                <Grid item xs={12}>
                  <TextField
                      variant="outlined"
                      required
                      fullWidth
                      name="password"
                      label="Password"
                      id="password"
                      autoComplete="current-password"
                      floatingLabelText="Password"
                      value={this.state.password}
                      onChange={this.handlePasswordChange}
                      type="password"
                      errorText={this.state.passwordError}
                  />


              <br /><br />
                  <FormControlLabel
                      control={<Checkbox value="allowExtraEmails" color="primary" />}
                      label="I accept the terms and conditions."
                  />
                </Grid>
              </Grid>
              <Button
                  type="submit"
                  fullWidth
                  variant="contained"
                  color="primary"
                  className={classes.submit}
                  primary={true}
                  label="Signup"

              >
                Sign Up
              </Button>
              <Grid container justify="flex-end">
                <Grid item>
                  <Link href="#" variant="body2">
                    Already have an account? Sign in
                  </Link>
                </Grid>
              </Grid>
            </form>
            <br /><br />
        </div>
        </Container>

    );
  }
}

export default withStyles(styles)(Signup);

这将导致每个周期3-4秒,并且不会给GPU造成负担。

重新启动内核,然后:

仅GPU

# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([],'GPU')  # hide the GPU
tf.config.set_visible_devices(cpus[0],'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
    
model.fit(x=X_train,y=y_train,epochs=25,batch_size=256,validation_data=(X_test,y_test),verbose=1
           )

这显然不可行,因为此型号显然需要CPU:

“ NotFoundError:此过程中没有可用的CPU设备”

默认(CPU和GPU混合):

重新启动内核,然后:

# force GPU (make GPU visible)
# note: does not work without restarting the kernel,otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([],'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0],'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()

model.fit(x=X_train,verbose=1
          )

这导致每个时期5-6秒,消耗GPU的所有RAM,并使用GPU的少量处理能力(

如果默认模式(CPU和GPU)引发以下错误,则表明GPU已被另一个进程占用,重新启动Windows会有所帮助: “内部错误:Blas GEMM启动失败”

还有许多谜团要留给我:

  • 为什么“混合”模式比仅CPU慢?
  • 是否可以在不重新启动内核的情况下更改可见设备,以避免出现以下错误? “可见设备初始化后无法修改”
  • 为什么with tf.device()方法不适用于此模型(无效),而适用于test_device()代码?

如果有人能提供一些见识,非常感谢:)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res