我不知道为什么在 pthread 子例程中更改变量访问/存储类型会显着提高性能

如何解决我不知道为什么在 pthread 子例程中更改变量访问/存储类型会显着提高性能

我是多线程编程的新手，我知道如果你不小心会有一些奇怪的副作用，但我没想到我写的代码会如此困惑。我正在写我认为是线程的明显开始/测试：只是总结 0 到 x 之间的数字（当然 https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/ 但我想要做的更多的是如何使用线程而不是练习如何尽可能快地制作该程序）。我使用函数调用来创建基于系统上硬编码内核数的线程，以及定义处理器是否具有多线程功能的“布尔值”。我或多或少地将工作分成每个线程，所以每个线程在一个范围内总结，理论上，如果所有线程都设法一起工作，我可以做 numcores*normal_computation，这确实令人兴奋，令我惊讶，它或多或少地按我的预期工作；直到我做了一些调整。

在继续之前，我认为一些代码会有所帮助：

这些是我在基本代码中使用的预处理器定义：

#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true,0 for false
#define BIGVALUE 1000000000UL

我使用这个结构将参数传递给我的面向线程的函数：

typedef struct sum_args
{
    int64_t start;
    int64_t end;
    int64_t return_total;
} sum_args;

这是创建线程的函数：

int64_t SumUpTo_WithThreads(int64_t limit)
{   //start counting from zero
    const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
    pthread_t threads[numthreads];
    sum_args listofargs[numthreads];
    int64_t offset = limit/numthreads; //loss of precision after decimal be careful
    int64_t total = 0;

    //i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
    for (int i = 0; i < numthreads-1; i++)
    {
        listofargs[i] = (sum_args){.start = offset*i,offset*(i+1)};
        pthread_create(&threads[i],NULL,SumBetween,(void *)(&listofargs[i]));
    }
    //edge case catch
    //limit + 1,since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
    listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1),.end = limit+1};
    pthread_create(&threads[numthreads-1],(void *)(&listofargs[numthreads-1]));

    //finishing
    for (int i = 0; i < numthreads; i++)
    {
        pthread_join(threads[i],NULL); //used to ensure thread is done before adding .return_total
        total += listofargs[i].return_total;
    }

    return total;
}

这里只是求和的“正常”实现，仅供比较：

int64_t SumUpTo(int64_t limit)
{
    uint64_t total = 0;
    for (uint64_t i = 0; i <= limit; i++)
        total += i;
    return total;
}

这是线程运行的函数，它有“两个实现”，一个出于某种原因的快速实现，一个出于某种原因的 SLOW 实现（这是我混淆的）：额外的旁注：我使用 pre -processor 指令只是为了使 SLOWER 和 FASTER 版本更容易编译。

void* SumBetween(void *arg)
{
    #ifdef SLOWER
    ((sum_args *)arg)->return_total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        ((sum_args *)arg)->return_total += i;
    #endif

    #ifdef FASTER
    uint64_t total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        total += i;
    ((sum_args *)arg)->return_total = total;
    #endif
    
    return NULL;
}

这是我的主要内容：

int main(void)
{
    #ifdef THREADS
    printf("%ld\n",SumUpTo_WithThreads(BIGVALUE));
    #endif

    #ifdef norMAL
    printf("%ld\n",SumUpTo(BIGVALUE));
    #endif 
    return 0;
}

这是我的编译（我把优化级别设置为0，以免编译器完全优化出愚蠢的求和程序，毕竟我想学习如何使用线程！！！）：>

make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe

make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe

clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

这里是结果/差异（注意，使用 GCC 生成的代码也有相同的副作用）：

slower:
sudo time ./slower.exe 
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%cpu (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps

faster:
sudo time ./faster.exe 
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%cpu (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps

为什么使用额外的堆栈定义变量比取消引用传入的结构指针要快得多！

我试图自己找到这个问题的答案。我最终做了一些测试，从我的 SumUpTo() 函数中实现了相同的基本/朴素求和算法，唯一的区别是它处理的数据间接。

结果如下：

Choose a function to execute!

int64_t sum(void) took: 2.207833 (s) //new stack defined variable,basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)

测试产生了我或多或少预期的值。因此我推断它必须是基于这个想法的东西。

只是为了添加更多信息，我正在运行 Linux，特别是 Mint 发行版。

我的处理器信息如下：

Architecture:                    x86_64
cpu op-mode(s):                  32-bit,64-bit
Byte Order:                      Little Endian
Address sizes:                   36 bits physical,48 bits virtual
cpu(s):                          8
On-line cpu(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
vendor ID:                       GenuineIntel
cpu family:                      6
Model:                           42
Model name:                      Intel(R) Core(TM) i7-2760QM cpu @ 2.40GHz
Stepping:                        7
cpu MHz:                         813.451
cpu max MHz:                     3500.0000
cpu min MHz:                     800.0000
BogoMIPS:                        4784.41
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 cpu(s):               0-7
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cach
                                 e flushes,SMT vulnerable
Vulnerability mds:               Mitigation; Clear cpu buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline,IBPB condit
                                 ional,IBRS_FW,STIBP conditional,RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
                                 stant_tsc arch_perfmon pebs bts nopl xtopology 
                                 nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
                                 64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
                                 pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
                                 adline_timer aes xsave avx lahf_lm epb pti ssbd
                                  ibrs ibpb stibp tpr_shadow vnmi flexpriority e
                                 pt vpid xsaveopt dtherm ida arat pln pts md_cle
                                 ar flush_l1d

如果您希望自己编译代码，或者查看我的特定实例的生成程序集，请查看：https://github.com/spaceface102/Weird_Threads 主要源代码是“countV2.c”，以防万一你迷路了。感谢您的帮助！

/*EOPost*/