如何解决如何在GPU上正确运行model.fit? 异常行为
目前,我正在上Udemy Python数据科学课程。在那里,有以下示例在Tensorflow中训练模型:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
model = Sequential()
# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.compile(loss='binary_crossentropy',optimizer='adam')
model.fit(x=X_train,y=y_train,epochs=3,validation_data=(X_test,y_test),verbose=1
)
我现在的目标是让它在我的GPU上运行。为此,我对最后一部分进行了如下更改(这些时间段是有目的的,我只想查看每个时间段在扩展之前需要花费多长时间):
with tf.device("/gpu:0"):
model.fit(x=X_train,verbose=1
)
为了进行比较,也如下:
with tf.device("/cpu:0"):
model.fit(x=X_train,verbose=1
)
但是,结果是非常出乎意料的:这两个版本都占用了GPU的所有内存,但似乎不对其进行任何计算,并且每个时期花费的时间完全相同。或者,GPU版本仅因以下错误而崩溃:
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value,from_value)
InternalError: Blas GEMM launch failed : a.shape=(32,78),b.shape=(78,m=32,n=78,k=78
[[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
Function call stack:
distributed_function
有时它会崩溃,有时它可以工作,但所需的时间与CPU一样长。有时,甚至CPU版本每个时期也要花费20秒,而其他时候则需要40秒。代码保持不变,所不同的是,我在两者之间重新启动了内核。我真的不明白。
当我使用以下代码测试GPU和conda环境时,一切似乎都可以正常工作,可重现,并且GPU的运行速度大约是CPU的20倍:
# https:// gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
def test_device (device_name: str):
shape = (int(10000),int(10000))
startTime = datetime.now()
with tf.device(device_name):
random_matrix = tf.random.uniform(shape=shape,minval=0,maxval=1)
dot_operation = tf.matmul(random_matrix,tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
result = sum_operation
print("Shape:",shape,"Device:",device_name)
print("—"*50)
print(result)
print("Time taken:",datetime.now() - startTime)
print("\n" * 2)
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec
所以,我确定我做错了什么。
TLTR:
在GPU上调用model.fit的正确方法是什么?
>在不更改代码的情况下,不同的运行如何导致截然不同的结果(崩溃,大大不同的计算时间)?非常感谢您的帮助!
解决方法
经过反复尝试,我终于找到了一种强制CPU或“混合使用”的工作方法。不过,GPU似乎似乎不起作用。我的原始帖子中的with tf.device()方法在这种情况下似乎没有任何作用。如果只想使用CPU,我必须隐藏GPU(Tensorflow 2.1.0):
仅CPU
import React from 'react';
import { firebaseApp } from '../utils/firebase';
import { browserHistory } from 'react-router';
import Helmet from "react-helmet";
import RaisedButton from 'material-ui/RaisedButton';
import TextField from 'material-ui/TextField';
import Paper from 'material-ui/Paper';
import Avatar from '@material-ui/core/Avatar';
import Button from '@material-ui/core/Button';
import CssBaseline from '@material-ui/core/CssBaseline';
import FormControlLabel from '@material-ui/core/FormControlLabel';
import Checkbox from '@material-ui/core/Checkbox';
import Link from '@material-ui/core/Link';
import Grid from '@material-ui/core/Grid';
import LockOutlinedIcon from '@material-ui/icons/LockOutlined';
import Typography from '@material-ui/core/Typography';
import Container from '@material-ui/core/Container';
import withStyles from "@material-ui/core/styles/withStyles";
const styles = {
paper: {
display: 'flex',flexDirection: 'column',alignItems: 'center',},avatar: {
backgroundColor: 'orange',form: {
width: '100%',// Fix IE 11 issue.
},submit: {
},};
class Signup extends React.Component {
constructor(props) {
super(props);
this.state = {
email: '',password: '',emailError: '',passwordError: ''
};
this.handleSubmit = this.handleSubmit.bind(this);
this.handlePasswordChange = this.handlePasswordChange.bind(this);
this.handleEmailChange = this.handleEmailChange.bind(this);
}
handleEmailChange(e) {
this.setState({ email: e.target.value });
}
handlePasswordChange(e) {
this.setState({ password: e.target.value });
}
handleSubmit(e) {
e.preventDefault();
const email = this.state.email.trim();
const password = this.state.password.trim();
firebaseApp.auth().createUserWithEmailAndPassword(email,password).then((user) => {
browserHistory.push('/polls/dashboard');
}).catch((error) => {
if (error.code === 'auth/weak-password') {
this.setState({ passwordError: error.message,emailError: '' });
} else {
this.setState({ emailError: error.message,passwordError: '' });
}
//console.log(error);
});
}
render() {
const { classes } = this.props; //<----- grab classes here in the props
return (
<Container component="main" maxWidth="xs">
<CssBaseline />
<div className={classes.paper}>
<Avatar className={classes.avatar}>
<LockOutlinedIcon />
</Avatar>
<Typography component="h1" variant="h5">
Sign up
</Typography>
<form className={classes.form} noValidate>
<Grid container spacing={2}>
<Grid item xs={12}>
<TextField
variant="outlined"
required
fullWidth
id="email"
label="Email Address"
name="email"
autoComplete="email"
floatingLabelText="Email"
value={this.state.email}
onChange={this.handleEmailChange}
errorText={this.state.emailError}
/>
</Grid>
<br /><br />
<Grid item xs={12}>
<TextField
variant="outlined"
required
fullWidth
name="password"
label="Password"
id="password"
autoComplete="current-password"
floatingLabelText="Password"
value={this.state.password}
onChange={this.handlePasswordChange}
type="password"
errorText={this.state.passwordError}
/>
<br /><br />
<FormControlLabel
control={<Checkbox value="allowExtraEmails" color="primary" />}
label="I accept the terms and conditions."
/>
</Grid>
</Grid>
<Button
type="submit"
fullWidth
variant="contained"
color="primary"
className={classes.submit}
primary={true}
label="Signup"
>
Sign Up
</Button>
<Grid container justify="flex-end">
<Grid item>
<Link href="#" variant="body2">
Already have an account? Sign in
</Link>
</Grid>
</Grid>
</form>
<br /><br />
</div>
</Container>
);
}
}
export default withStyles(styles)(Signup);
这将导致每个周期3-4秒,并且不会给GPU造成负担。
重新启动内核,然后:
仅GPU
# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([],'GPU') # hide the GPU
tf.config.set_visible_devices(cpus[0],'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
model.fit(x=X_train,y=y_train,epochs=25,batch_size=256,validation_data=(X_test,y_test),verbose=1
)
这显然不可行,因为此型号显然需要CPU:
“ NotFoundError:此过程中没有可用的CPU设备”
默认(CPU和GPU混合):
重新启动内核,然后:
# force GPU (make GPU visible)
# note: does not work without restarting the kernel,otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([],'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0],'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()
model.fit(x=X_train,verbose=1
)
这导致每个时期5-6秒,消耗GPU的所有RAM,并使用GPU的少量处理能力(
如果默认模式(CPU和GPU)引发以下错误,则表明GPU已被另一个进程占用,重新启动Windows会有所帮助: “内部错误:Blas GEMM启动失败”
还有许多谜团要留给我:
- 为什么“混合”模式比仅CPU慢?
- 是否可以在不重新启动内核的情况下更改可见设备,以避免出现以下错误? “可见设备初始化后无法修改”
- 为什么with tf.device()方法不适用于此模型(无效),而适用于test_device()代码?
如果有人能提供一些见识,非常感谢:)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。