微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

OpenACC Fortran 循环中的顺序 dot_product

如何解决OpenACC Fortran 循环中的顺序 dot_product

在 Fortran 程序中,我有一个大循环,其中对循环内生成的小向量进行了多次 dot_product 调用

program test
        implicit none

        real :: array1(2,2),array2(2,res(2)
        real :: subarray1(2),subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1,array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1,subarray2)
        do i = 1,2
                subarray1(:) = array1(:,i)
                subarray2(:) = array2(:,i)
                res(i) = dot_product(subarray1,subarray2)
        enddo
        !$acc end kernels
        !$acc end data

        print "(2(g0,x))",res
endprogram

当使用 PGI 编译器编译时,dot_product 的加速实现似乎使用加速循环,因此阻止更好地加速主循环(在 gang 和 vector 上):

test:
     11,Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14,Loop is parallelizable
         Generating Tesla code
         14,!$acc loop gang ! blockidx%x
         15,!$acc loop vector(32) ! threadidx%x
         17,!$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14,CUDA shared memory used for subarray2,subarray1
     15,Loop is parallelizable
     17,Loop is parallelizable

从日志中可以看出,它对循环私有向量使用隐式归约和共享内存。

有没有办法强制 dot_product 按顺序运行?

解决方法

有没有办法强制 dot_product 按顺序运行?

只要您不介意数组语法也按顺序运行,只需在循环指令中添加“gang vector”即可。

% cat test.f90
program test
        implicit none

        real :: array1(2,2),array2(2,res(2)
        real :: subarray1(2),subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1,array2) copyout(res)
        !$acc kernels loop gang vector private(subarray1,subarray2)
        do i = 1,2
                subarray1(:) = array1(:,i)
                subarray2(:) = array2(:,i)
                res(i) = dot_product(subarray1,subarray2)
        enddo
        !$acc end data

        print "(2(g0,x))",res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
     11,Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     13,Loop is parallelizable
         Generating Tesla code
         13,!$acc loop gang,vector(32) ! blockidx%x threadidx%x
         14,!$acc loop seq
         16,!$acc loop seq
     13,Local memory used for subarray2,subarray1
     14,Loop is parallelizable
     16,Loop is parallelizable

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。