CUDA 减少不连续的子数组

如何解决CUDA 减少不连续的子数组

我正在为一个库编写一个函数，该函数接受一个包含 2 次幂元素的大型数组（在 GPU 内存中）。此函数必须对不连续的子数组（长度相等，也是 2 的幂）求和，以生成更小（或很少，大小相等）的数组。这是 here 描述的 OpenMP 函数的 GPU 版本。

例如

in[8] = { ... }
out = { in[0]+in[2],in[1]+in[3],in[4]+in[6],in[5]+in[7] }

缩减子数组的元素由用户给定的位索引列表 bitInds 确定。它们告知如何将 in 元素的索引（想象为位串）映射到 out 索引。

例如，如果

bits = { 0,2 }

然后 in index 0 = 000 映射到 out[0]，并且 in index 4 = 1 00 映射到 out[2]。

位的长度可以从 1 到 log2(length(in))。 out 的长度为 pow(2,length(bits))。这意味着 out 可以和 in 一样大，或者是一半，或者是四分之一，等等。

为此功能准备设备内存，并在 CUDA 内核中执行结构良好的缩减似乎具有挑战性，我不确定如何开始。由于 in 保证非常大，因此重要的线程本地和顺序访问 in。如何有效地对 pow(2,length(bits)) 的可能不连续的子数组执行 in 减少？

解决方法

我尝试了三种实现并比较了它们的性能。

A：根据@RobertCrovella 的 example，在每块独占全局内存上使用 atomics，然后在块之间减少。
B：让每个线程直接以原子方式写入全局内存。
C：让线程写入块本地共享内存，最后以原子方式将它们减少到全局内存。

性能和尺寸限制方面的最佳解决方案出人意料地是最简单的，B。我展示了伪代码，并按照这些解决方案的复杂性顺序描述了它们的权衡。

B

这个方法让每个线程原子地写入全局内存，直接写入最终输出。除了在全局内存中拟合 out 和 in 之外，它对它们的大小没有施加额外的限制。

__global__ void myfunc_kernel(double* in,inLen,double* out,...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically add directly to out global memory (vs all threads)
    atomicAdd(&out[outInd],in[inInd]);
}

void myfunc(double* in,int inLen,...) {

    // copy unspecified additional args into device memory (shared,if fits)
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out,...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    myfunc_kernel<<numBlocks,numThreadsPerBlock>>(d_in,d_out,...);

    // copy output to host,and clean up
    cudaMemcpy(out,...);
    cudaFree(...);
}

C

此方法通过原子分配每个块共享内存，块中的线程本地减少到这些共享内存中。此后，共享内存在块之间通过原子减少到输出全局内存中。必须将本地 out 装入共享内存将 outLen 限制为最多（例如）4096（取决于计算能力）。

__global__ void myfunc_kernel(double* in,...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // each block has a local copy of 'out',in shared memory
    extern __shared__ double outBlock[];
    
    // efficiently initialise shared memory to zero
    // (looped in case fewer threads than outLen)
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        outBlock[i] = 0;
    __syncthreads();

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically contribute thread's 'in' value to block shared memory
    // (vs other threads in block)
    atomicAdd(&outBlock[outInd],in[inInd]);
    __syncthreads();

    // keep one (or fewer) thread(s) per output element alive in each block
    if (threadIdx.x >= outLen)
        return;

    // remaining threads atomically reduced shared memory into global
    // (looped in-case fewer threads than 'outLen')
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        // (vs all surviving threads)
        atomicAdd(&out[i],outBlock[i]);

    /* This final reduction may be possible in parallel/tiers,* though documentation & examples about reducing shared memory
     * between blocks is disappointingly scarce
     */
}

void myFunc(double* in,int outLen,...) {
    
    // if 'out' cannot fit in each block's shared memory,error and quit
    if (outLen > 4096)
        ...
       
    // copy unspecified additional args into device memory
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out,...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    // allocate shared memory to fit 'out' in each block
    size_t mem = outLen * sizeof(double);
    myfunc_kernel<<numBlocks,numThreadsPerBlock,mem>>(d_in,...);
    cudaFree(...);
}

请注意，所有缩减最终都是以原子方式执行的。这应该不是必需的，因为在共享内存完成后，减少到全局内存具有严格的结构。然而，我无法在网上找到有关如何将共享内存数组共同减少为全局内存的文档或单个演示。

A

此方法使用与此 histogram example 相同的范例。它在全局内存中创建 #numBlocks 的 out 个副本，以便每个块都写入独占内存。块内的线程原子地写入其块的全局内存分区。此后，第二个内核将这些分区缩减为最终的 out 数组。这显着限制了 outLen，使其小于 deviceGlobalMem/numBlocks。

__global__ void myfunc_populate_kernel(double* in,double* outs,...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...

    // map to this block's exclusive subarray in global memory
    int blockOutInd = blockIdx.x*outLen + outInd;
    
    // atomically contribute thread's 'in' value to block's global memory
    // (vs threads in same block)
    atomicAdd(&outs[outInd],in[inInd]);
}

__global__ void myfunc_reduce_kernel(double* out,int numOuts) {

    // each thread handles one element of 'out'
    int outInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (outInd >= outLen)
        return;
  
    // and determines it by reducing previous per-block partitions
    double total = 0;
    for (int n=0; n<numOuts; n++)
        total += outs[outInd + n*outLen];

    out[outInd] = total;
}

void myfunc(double* in,...) {
    
    int numThreadsPerBlock = 128;
    int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
 
    // if #numPopBlocks copies of 'out' cannot fit in global memory,error and exit
    if (outLen * numPopBlocks > ...)
        ...

    // copy unspecified additional args into device memory (shared,if fits)
    double d_in = ...

    // create a working 'out' for each block,in global memory (init to zero)
    size_t mem = outLen * numPopBlocks * sizeof(double);
    double d_outs = ...
    cudaMalloc(&d_outs,mem);
    cudaMemset(d_out,mem);
    
    // populate each blocks partition of global memory 'd_outs'
    myfunc_populate_kernel<<numPopBlocks,d_outs,outLen,...);

    // create global memory for final output
    double d_out = ...

    // reduce each blocks partition of global memory into output 'd_out'
    int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
    myfunc_reduce_kernel<<numReduceBlocks,numThreadsPerBlock>>(d_out,numPopBolocks);

    //copy output to host,...);
    cudaFree(...);
}

比较

我使用 Quadro P6000（24GB，6.1 CC）测试了这些方法，带有 inLen = 2^5,2^10,2^15,2^20,2^25 和（对于所有值）outLen = m2^5,2^15 (outLen <= inLen)。这是运行时表：

╔══════════╦═════════╦══════════╦══════════╦══════════╗
║  inLen   ║ outLen  ║  A       ║  B       ║  C       ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║       32 ║      32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║     1024 ║      32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║     1024 ║    1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║    32768 ║      32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║    32768 ║    1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║    32768 ║   32768 ║ 0.000762 ║ 0.000111 ║ N/A      ║
║  1048576 ║      32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║  1048576 ║    1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║  1048576 ║   32768 ║ 0.017547 ║ 0.000899 ║ N/A      ║
║  1048576 ║ 1048576 ║ N/A      ║ 0.006092 ║ N/A      ║
║ 33554432 ║      32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║    1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║   32768 ║ 0.013734 ║ 0.011915 ║ N/A      ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝

即使限制较少，方法 B 仍然是明显的赢家，如下所示：

╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B  ║ C/B ║
╠══════════════╬══════╬═════╣
║            1 ║ 0.9  ║ 0.5 ║
║           32 ║ 5.5  ║ 1.6 ║
║            1 ║ 6.2  ║ 1.5 ║
║         1024 ║ 3.8  ║ 1.0 ║
║           32 ║ 4.9  ║ 1.6 ║
║            1 ║ 6.9  ║     ║
║        32768 ║ 1.2  ║ 0.3 ║
║         1024 ║ 4.0  ║ 6.7 ║
║           32 ║ 19.5 ║     ║
║            1 ║      ║     ║
║      1048576 ║ 0.9  ║ 0.4 ║
║        32768 ║ 4.0  ║ 1.7 ║
║         1024 ║ 1.2  ║     ║
╚══════════════╩══════╩═════╝

陷阱

很难对 CUDA 文档和在线示例的状态保持礼貌。所以我不会发表评论，而是向困惑的读者提供这些陷阱警告：

请注意，虽然共享内存似乎在内核第一次执行时初始化为零，但在后续执行中不会重新初始化。因此，您必须在内核中re-initialise manually（没有方便的 CUDA 函数）。
atomicAdd 缺少 double 类型的重载。 doc 声称这仅适用于小于 60 的计算能力，但我的代码无法在没有手动过载的情况下使用 62 CC GPU 进行编译。此外，#if __CUDA_ARCH__ < 600 的 doc's 代码片段是错误的，因为较新的 Arch 预编译器没有定义 __CUDA_ARCH__。查看正确的宏 here
请注意，您在线查看的流行的并行归约范式会假设所有数据都在全局内存中，或者在单个块的共享内存中。将共享内存减少为全局结构似乎很深奥。

CUDA 减少不连续的子数组

如何解决CUDA 减少不连续的子数组

解决方法

B

C

A

比较

陷阱

相关推荐