如何解决在 EC2 上找不到 Nvidia A100 设备
我在访问 AWS EC2 实例 (p4d.24xlarge) 上的 nvidia A100 GPU 时遇到问题。但是,我可以毫无问题地使用 V100 GPU (p3.16xlarge)。
在 p4d 上,我从源代码重建了所有内容,就像我在 p3 实例上所做的那样,包括来自 https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run 的 nvidia-drivers。
任何想法可能是什么问题?
当我运行 nvidia-smi
时,它显示有 8 个 A100 GPU 可用(预期)。我写了一些简单的代码来查询系统中的GPU数量(代码如下),得到如下错误:
正在获取设备... GPUassert:系统尚未初始化 DevInfo.cu 20
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#define USE_CUDA
#define gpuErrchk(ans) { gpuAssert((ans),__FILE__,__LINE__); }
inline void gpuAssert(cudaError_t code,const char *file,int line,bool abort=true)
{
if (code != cudaSuccess)
{
printf("GPUassert: %s %s %d\n",cudaGetErrorString(code),file,line);
if (abort) exit(code);
}
}
int main(int argc,char **argv)
{
int numDevs = 0;
printf("Obtaining devices...\n");
gpuErrchk(cudaGetDeviceCount(&numDevs));
printf("Number of devices: %d\n",numDevs);
return 0;
}
nvidia-smi 输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:10:1C.0 Off | 0 |
| N/A 32C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:10:1D.0 Off | 0 |
| N/A 31C P0 47W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:20:1C.0 Off | 0 |
| N/A 31C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:20:1D.0 Off | 0 |
| N/A 32C P0 49W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:90:1C.0 Off | 0 |
| N/A 32C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:1D.0 Off | 0 |
| N/A 31C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:A0:1C.0 Off | 0 |
| N/A 33C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。