为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环？

如何解决为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环？

所有基准测试都在：Icelake：Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)

编辑：我无法在 Broadwell 上重现它，@PeterCordes 也无法在 Skylake 上重现它

我试图对执行整数 min(a,b) 的不同方法进行基准测试，但遇到了一些无法解释的行为，我已将其归结为以下基准：

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noinline,noclone))

#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL  0
#define BREAK_DEPENDENCY      0
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start,end;
    const uint64_t N = 1000000;
    start            = _rdtsc();
    uint64_t v0,v1,dst,loop_cnt;
    asm volatile(
        "xorl %k[v0],%k[v0]\n\t"
        "movl $1,%k[v1]\n\t"
        "movl %[N],%k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt],%k[v0]\n\t"
        "xorl %k[loop_cnt],%k[v1]\n\t"
        "movl %k[v0],%k[dst]\n\t"
#else
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#endif
        ".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
        "nop\n\t"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0],%k[v0]\n\t"
        "xorl %k[v1],%k[v1]\n\t"
#endif
        // macro-fusion is NOT broken
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0),[ v1 ] "=&r"(v1),[ dst ] "=&r"(dst),[ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc","memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s],WITH_NOP_BEFORE_DECL - [%s],"
        "BREAK_DEPENDENCY - [%s]\n\t",SIX_BYTES_COMPUTATION ? "ON" : "OFF",WITH_NOP_BEFORE_DECL ? "ON" : "OFF",BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"\n",dif);
}

打开 WITH_NOP_BEFORE_DECL 以便在 nop + decl 之前有一个 jnz 会在打开 SIX_BYTES_COMPUTATION 时导致可衡量的性能改进，但会导致可衡量的SIX_BYTES_COMPUTATION 关闭时性能下降。

这是数字：

SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [OFF],BREAK_DEPENDENCY - [OFF]
    2.080 "Cycles" <--- Just 6 nops

SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [ON],BREAK_DEPENDENCY - [OFF]
    2.363 "Cycles" <--- Performance degradation from previous

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    2.185 "Cycles" <--- Computation then decl + jnz

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    1.945 "Cycles" <--- Performance improvement from previous

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    1.919 "Cycles" <--- Breaking dependencies has best performance

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    2.046 "Cycles" <--- nop hurts performance when breaking dependencies

这可能与注册文件填满有关？我找到了一个潜在的有趣指标 uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread]，它具有以下输出：

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    473,647      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    495,380      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    1,406,244      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    875,364      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    647,297      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    501,015      uops_issued.stall_cycles

它似乎与 SIX_BYTES_COMPUTATION on 和 WITH_NOP_BEFORE_DECL on or off 相对应，但我不确定 1) 为什么 nop 会在寄存器文件中节省空间。

我很确定这不是对齐问题，因为循环体的前 6 个字节和 .p2align 4 + decl 之间的 jnz {{ 1}} + decl 将位于不同的 16 字节对齐区域，并且性能差异取决于循环体中的内容（因此，如果它是对齐的东西，则循环不重要） body 是 nops 或计算）。

我认为这可能与某些依赖链问题有关但是因为如果我在循环结束时打破对 jnz 和 v0 的依赖，那么 v1打开会导致性能下降。 我很可能错了，因为我不知道为什么循环结束前的 WITH_NOP_BEFORE_DECL 会影响任何依赖性问题。

它几乎肯定与端口调度无关。我在想可能有什么奇怪的事情发生，nop 偶然导致更好的调度，但在端口 1,2,5,6 上没有任何不同的 uop 与 out 或 nop on:

打开 WITH_NOP_BEFORE_DECL 和关闭 SIX_BYTES_COMPUTATION 的每个端口的说明：

WITH_NOP_BEFORE_DECL

SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [OFF] 1,147,196 uops_dispatched.port_0 1,114,665 uops_dispatched.port_1 1,138,238 uops_dispatched.port_5 1,266,212 uops_dispatched.port_6 开启和 SIX_BYTES_COMPUTATION 开启的每个端口的说明：

WITH_NOP_BEFORE_DECL

我的主要理论是寄存器重命名过程存在一些低效率，即没有 SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [ON] 1,177,092 uops_dispatched.port_0 1,081,734 uops_dispatched.port_1 1,103,314 uops_dispatched.port_5 1,296,546 uops_dispatched.port_6 的性能限制，幸运的是 nop 隐藏了这个问题，但我并不在对此充满信心。

谁能帮我理解这种行为。

编辑：完整的 cpp 代码和新时间，包括预热和 nop 之前的 lfence。

新代码

rdtsc

新时代

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))

#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start,end;
    const uint64_t N        = (1UL << 24);
    const uint64_t WARMUP_N = N << 3;
    uint64_t       v0,loop_cnt;


    asm volatile(
        "xorl %k[v0],%k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
        "xorl %k[loop_cnt],%k[dst]\n\t"
        ".p2align 4\n\t"
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0),[ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc","memory");


    asm volatile("lfence\n\t" : : : "memory");
    start = _rdtsc();
    asm volatile(
        "xorl %k[v0],%k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt],%k[v1]\n\t"
#endif
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0),dif);
}


int
main(int argc,char ** argv) {
    bench();
}

新时代的趋势和以前一样，只是它们都快了很多。

编辑：Icelake 性能数据

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    0.645 "Cycles"

编辑：我确定它与依赖链或字节无关。它在某些地方添加一个 nop（非后端 uop）确实有助于提高性能。这是我认为非常清楚地证明这一点的基准。

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.681 "Cycles"
    18,522,385,353      lsd.uops                                                    
         1,038,665      idq.dsb_uops                                                
     4,270,402,172      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.778 "Cycles"
    20,669,567,680      lsd.uops                                                    
         1,049,193      idq.dsb_uops                                                
     4,807,261,565      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.734 "Cycles"
    12,080,048,840      lsd.uops                                                    
         1,035,128      idq.dsb_uops                                                
     4,552,666,461      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.659 "Cycles"
    14,232,154,418      lsd.uops                                                    
         1,150,777      idq.dsb_uops                                                
     4,134,501      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.735 "Cycles"
    12,166,963      lsd.uops                                                    
           982,311      idq.dsb_uops                                                
     4,553,457,015      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    0.644 "Cycles"
    16,374,872,770      lsd.uops                                                    
         1,022,379      idq.dsb_uops                                                
     4,055,306,960      cpu-cycles

结果：您基本上可以看到它在后端性能中不会执行的 1 uop 是否为 ~.39 ref-cycles 5 uop 循环的迭代（ICL 前端宽度）。否则没有 NOP 或异或归零填充，它的 ~.54 ref-cycles 4-uop 循环的迭代：

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>


#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif

#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))
void BENCH_FUNC_ATTR
bench() {
    uint64_t          start,end;
    const uint64_t    N        = (1UL << 31);
    const uint64_t    WARMUP_N = N >> 3;
    register uint64_t v0 asm("rdi");
    register uint64_t v1 asm("rsi");
    register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
    register uint64_t v3 asm("rax");
#endif
    register uint64_t loop_cnt asm("rcx");


    asm volatile(
        "xorl %k[v0],%k[v1]\n\t"
        "xorl %k[v2],%k[v2]\n\t"
#if COMPUTE_OUP
        "xorl %k[v3],%k[v3]\n\t"
#endif
        "movl %[N],%k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0],%k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0],%k[v0]\n\t"
#elif COMPUTE_UOP
        "incl %k[v3]\n\t"
#endif

        "incl %k[v0]\n\t"
        "incl %k[v1]\n\t"
        "incl %k[v2]\n\t"

        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0),[ v2 ] "=&r"(v2),#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc","memory");


    start = _rdtsc();
    asm volatile(
        "xorl %k[v0],#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc","memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf("UOP         -> %d\n",UOP);
    printf("BYTE        -> %d\n",BYTE);
    printf("NOP         -> %d\n",NOP);
    printf("BREAK_DEP   -> %d\n",BREAK_DEP);
    printf("COMPUTE_UOP -> %d\n",COMPUTE_UOP);
    printf("%.3lf \"Cycles\"\n",char ** argv) {
    bench();
}

运行脚本（已修复）：

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,840,894      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257      idq_uops_not_delivered.cycles_fe_was_ok                                   
    4,791,034      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,411,711      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,915,776      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,334      idq_uops_not_delivered.cycles_fe_was_ok                                   
  215,743      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,735      idq_uops_not_delivered.cycles_fe_was_ok                                   
  214,953,593      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924      idq_uops_not_delivered.cycles_fe_was_ok                                   
    7,883,534      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.395 "Cycles"
2,440,570      idq_uops_not_delivered.cycles_fe_was_ok                                   
   26,095,530      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.520 "Cycles"
2,821,992,876      idq_uops_not_delivered.cycles_fe_was_ok                                   
    4,762,782      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 1
0.624 "Cycles"
3,864,366,562      idq_uops_not_delivered.cycles_fe_was_ok                                   
1,450,508,248      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.539 "Cycles"
2,947,391,859      idq_uops_not_delivered.cycles_fe_was_ok                                   
1,341,303,591      uops_issued.stall_cycles

编辑：有趣的事情。看来 import os import sys fname = "test-nop" if (len(sys.argv) > 1): fname = sys.argv[1] build_cmd = "g++ -DUOP={} -DBYTE={} -DNOP={} -DBREAK_DEP={} -DCOMPUTE_UOP={} -O3 -std=c++17 -march=native -mtune=native " + fname + ".cc -o " + fname run_cmd = "perf stat -e idq_uops_not_delivered.cycles_fe_was_ok -e uops_issued.stall_cycles ./{}" zero_one = [0,1] uop = [1,4] byte = [1,4] nop = [1] break_dep = [1] compute_uop = [1] for n in nop: for u in uop: for b in byte: if b < u: continue os.system(build_cmd.format(u,b,n,0)) os.system(run_cmd.format(fname)) for bd in break_dep: for u in uop: for b in byte: if b != 2 * u: continue if b < u: continue os.system(build_cmd.format(u,bd,0)) os.system(run_cmd.format(fname)) os.system(build_cmd.format(1,1)) os.system(run_cmd.format(fname)) os.system(build_cmd.format(0,0)) os.system(run_cmd.format(fname)) 要使 5-uop 循环的性能优于 4-uop 循环，放置很重要。零习语 nop，然而总是能提高性能。以下是我们看到 5-uop 循环执行的 4 种情况的数量，其中 xorl / nop 在不同点交错。 xorl 版本仅在第一条指令时有改进，而 nop 版本始终具有性能改进。考虑到 xorl 在帮助的第一个结果，这有点奇怪。我唯一能想到的是，位置可能会影响事物在 uop 缓存或 LSD 缓冲区中的放置位置？

数字：

nop

编辑：循环中具有 4 个独立 ################################################################ <nop,xorl,etc...> incl incl incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.389 "Cycles" 2,418,941,957 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.389 "Cycles" 2,490,126 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.390 "Cycles" 2,125,302 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 2 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.553 "Cycles" 3,033,520,044 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.394 "Cycles" 2,442,515,834 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl <nop,etc...> incl incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.566 "Cycles" 3,390,955,219 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.563 "Cycles" 3,373,556,409 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.565 "Cycles" 3,380,145,525 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.391 "Cycles" 2,428,978,799 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl incl <nop,etc...> incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.564 "Cycles" 3,377,709,071 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.563 "Cycles" 3,494,813 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.564 "Cycles" 3,019,951 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.389 "Cycles" 2,319,618 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl incl incl <nop,etc...> decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.556 "Cycles" 3,329,607,623 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.559 "Cycles" 3,340,246,297 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.553 "Cycles" 3,254,092 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.543 "Cycles" 3,279,214,443 idq_uops_not_delivered.cycles_fe_was_ok 指令的试验数据。使其成为带有 incl 的 6 uop 循环或不带 nop 的 5uop 循环。在以下情况下添加第 6 个 uop 时，我能够看到可测量且可重现的性能改进（更适度）：如果第 6 个 uop 是 nop（1、2 或 4 个字节），则它必须介于第一个和第二个incl。如果第 6 个 uop 是零习语 xor 它可以在任何地方。以下是第 6 条指令在第 1 条和第 2 条incl 之间时的结果：

循环看起来像：

incl
<6th instruction>
incl
incl
incl
decl
jnz

次数：

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.603 "Cycles"
3,242,400,541      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.604 "Cycles"
3,244,473,075      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.601 "Cycles"
3,239,305,874      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.641 "Cycles"
3,330,250      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.649 "Cycles"
3,334,019      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.788 "Cycles"
3,989,749,825      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.551 "Cycles"
2,893,829,059      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.604 "Cycles"
3,007,481,786      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.620 "Cycles"
3,755,030,033      idq_uops_not_delivered.cycles_fe_was_ok

为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环？

如何解决为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环？

相关推荐