CUDA平铺矩阵乘法讲解

如何解决CUDA平铺矩阵乘法讲解

我正在尝试了解 CUDA SDK 8.0 中的 this sample 代码是如何工作的：

// refresh if the client does not find the first css file
var qs = document.querySelector("link[rel='stylesheet']");
var MrChecker = new XMLHttpRequest(),CheckThisUrl = qs.getAttribute('href');
  
MrChecker.open('get',CheckThisUrl,true);
MrChecker.onreadystatechange = checkReadyState;


function checkReadyState() {
  if (MrChecker.readyState === 4) {
    if ((MrChecker.status == 200) || (MrChecker.status == 0)) {
      //alert(CheckThisUrl + ' page is exixts');
    }
    else {
      //alert(CheckThisUrl + ' not exists');
      setTimeout(function(){document.location.reload();},2000);
      return;
    }
  }
}

MrChecker.send(null);

内核的这部分对我来说相当棘手。我知道矩阵 A 和 B 表示为数组 (*float)，而且我也知道使用共享内存来计算点积的概念，这要归功于共享内存块。

我的问题是我不明白代码的开头，尤其是 3 个特定变量（template <int BLOCK_SIZE> __global__ void matrixMulCUDA(float *C,float *A,float *B,int wA,int wB) { // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB; .... ....、aBegin 和 aEnd）。有人可以为我制作一个可能执行的示例图，以帮助我了解索引在这种特定情况下的工作方式吗？谢谢

解决方法

这是一张图，用于理解为 CUDA 内核的第一个变量设置的值以及执行的整体计算：

矩阵使用行优先顺序存储。 CUDA 代码假设矩阵大小可以除以 BLOCK_SIZE。

根据内核 CUDA 网格，矩阵 A、B 和 C 虚拟地分成块。 C 的所有块都可以并行计算。对于给定的 C 深灰色块，主循环遍历 A 和 B 的几个浅灰色块（同步）。每个块都使用 BLOCK_SIZE * BLOCK_SIZE 线程并行计算。

bx 和 by 是当前块在 CUDA 网格中的基于块的位置。 tx 和 ty 是当前线程在 CUDA 网格的当前计算块内计算的基于单元格的位置。

这里是对aBegin变量的详细分析： aBegin 指的是矩阵 A 的第一个计算块的第一个单元格的内存位置。设置为 wA * BLOCK_SIZE * by 是因为每个块包含 BLOCK_SIZE * BLOCK_SIZE 个单元格，并且在 wA / BLOCK_SIZE 的当前计算块上方有 by 个水平块和 A 个块。因此，(BLOCK_SIZE * BLOCK_SIZE) * (wA / BLOCK_SIZE) * by = BLOCK_SIZE * wA * by。

同样的逻辑适用于 bBegin：它被设置为 BLOCK_SIZE * bx，因为在矩阵 bx 的第一个计算块的第一个单元之前，内存中有 BLOCK_SIZE 个大小为 B 的块。

a 在循环中增加 aStep = BLOCK_SIZE 以便下一个计算块位于当前计算块 A 的右侧（在绘图上）。 b 在同一循环中增加 bStep = BLOCK_SIZE * wB，以便下一个计算块是 B 的当前计算块的底部（在绘图上）的后面。