DPCPP：在 SYCL+ OneAPI

如何解决DPCPP：在 SYCL+ OneAPI

我尝试在 GPU 上执行我的 slice_matrix 函数。实际功能是：

    //Function which Slice a specific part of my matricx
template<class T>
std::vector<std::vector<T>> slice_matrix(std::vector<std::vector<T>> mat,int i,int j,int r,int c) {

    std::vector<std::vector<T>> out(r,std::vector<T>(c,0));

    for (int k = 0; k < r; k++) {
        std::vector<T> temp(mat[i + k].begin() + j,mat[i + k].begin() + j + c);
        out[k] = temp;
    }

    return out;
};

我代码的SYCL部分是：

auto event = gpuQueue.submit(
                [&](sycl::handler &h) {
                    //local copy of fun
                    auto f = fun;
                    sycl::accessor img_accessor(img_buffer,h,sycl::read_only);
                    sycl::accessor ker_accessor(ker_buffer,sycl::read_only);
                    sycl::accessor out_accessor(out_buffer,sycl::write_only);
                    h.parallel_for(sycl::range<2>(img_row,filt_col),[=](sycl::id<2> index) {
                                int row = index[0];
                                int col = index[1];
                                out_accessor[index] = f(slice_matrix_gpu(img_accessor,row,col,filt_row,ker_accessor);

                            });

                });

我知道 vector 不会创建连续的内存块。所以我使用了向量，并尝试将其解释为二维数据块。我定义的：

/*change 2D Matrices to the 1D linear arrays,*
         *and operate on them as contiguous blocks */
        int M = img_row * img_col;
        int N = filt_row * filt_col;
        int H = out_row * out_col;


        //Define Buffer for
        sycl::buffer<Tin,1> img_buffer(&img[0],sycl::range<1>(M));
        sycl::buffer<Tin,1> ker_buffer(&ker[0],sycl::range<1>(N));
        sycl::buffer<Tin,2> out_buffer(&out[0],sycl::range<2>(out_row,out_col));

但是我不知道该怎么办？！我应该像 2D 一样传递访问器，还是应该更改 slice_matrix 并表现得像 2D 矩阵。我应该指出 slice_matrix 函数可能被其他函数调用，在这种情况下它在 cpu 上执行。我的意思是这个函数不仅仅用于在 GPU 上执行，它还用于在 cpu 上执行，即：

if (use_tbb) {
        uTimer *timer = new uTimer("Executing Code On cpu");
        tbb::parallel_for(
                tbb::blocked_range2d<int,int>(0,out_row,out_col),[&](tbb::blocked_range2d<int,int> &t) {
                    for (int n = t.rows().begin(); n < t.rows().end();
                            ++n) {
                        for (int m = t.cols().begin(); m < t.cols().end();
                                ++m) {
                            out[n][m] = fun(
                                    slice_matrix_cpu(img,n,m,ker);
                        }
                    }
                });
        timer->~uTimer();
        return out;

解决方法

我不确定我是否理解您的问题，但也许这会有所帮助，如果您有其他问题，可以告诉我。

您的方法看起来不太适合卸载。这让我立刻想到了“重构代码”——换句话说，采取不同的方法来获得更好的性能。

困难的部分是我真的不知道你为什么选择你的方法。所以，现在我假设这是一个选项（因为如果不是，我不知道该给你什么建议）。

一般来说，如果将与加速器共享，将数据布局在连续空间中是一个非常好的主意。它使代码更容易理解，数据传输更有效。所以，我建议你这样做。拥有大量较小的数据元素（如短向量）通常不会通过卸载设备带来有趣的加速。

完成后，SYCL 很高兴让您声明它是访问器的一维、二维或 3 维数组。它们假设是数据的线性集合，所有变化都是您用来选择数据元素的索引数量。做最自然的事情。

这是我的想法。如果这种方法对您不起作用，我认为您不会发现 GPU 是一个好的解决方案。但是，如果您真的坚持使用它 - USM 可能是一种更简洁的编码方式。但是，我不认为你会得到好的表现。但是，我猜是因为我不太了解你的代码。

祝你好运。我希望这会有所帮助...如果没有，请告诉我。