为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环?

如何解决为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环?

所有基准测试都在:Icelake:Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)

编辑:我无法在 Broadwell 上重现它,@PeterCordes 也无法在 Skylake 上重现它

我试图对执行整数 min(a,b) 的不同方法进行基准测试,但遇到了一些无法解释的行为,我已将其归结为以下基准:

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noinline,noclone))

#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL  0
#define BREAK_DEPENDENCY      0
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start,end;
    const uint64_t N = 1000000;
    start            = _rdtsc();
    uint64_t v0,v1,dst,loop_cnt;
    asm volatile(
        "xorl %k[v0],%k[v0]\n\t"
        "movl $1,%k[v1]\n\t"
        "movl %[N],%k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt],%k[v0]\n\t"
        "xorl %k[loop_cnt],%k[v1]\n\t"
        "movl %k[v0],%k[dst]\n\t"
#else
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#endif
        ".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
        "nop\n\t"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0],%k[v0]\n\t"
        "xorl %k[v1],%k[v1]\n\t"
#endif
        // macro-fusion is NOT broken
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0),[ v1 ] "=&r"(v1),[ dst ] "=&r"(dst),[ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc","memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s],WITH_NOP_BEFORE_DECL - [%s],"
        "BREAK_DEPENDENCY - [%s]\n\t",SIX_BYTES_COMPUTATION ? "ON" : "OFF",WITH_NOP_BEFORE_DECL ? "ON" : "OFF",BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"\n",dif);
}


打开 WITH_NOP_BEFORE_DECL 以便在 nop + decl 之前有一个 jnz 会在打开 SIX_BYTES_COMPUTATION 时导致可衡量的性能改进,但会导致可衡量的SIX_BYTES_COMPUTATION 关闭时性能下降。

这是数字:

SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [OFF],BREAK_DEPENDENCY - [OFF]
    2.080 "Cycles" <--- Just 6 nops

SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [ON],BREAK_DEPENDENCY - [OFF]
    2.363 "Cycles" <--- Performance degradation from previous

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    2.185 "Cycles" <--- Computation then decl + jnz

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    1.945 "Cycles" <--- Performance improvement from previous

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    1.919 "Cycles" <--- Breaking dependencies has best performance

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    2.046 "Cycles" <--- nop hurts performance when breaking dependencies


这可能与注册文件填满有关?我找到了一个潜在的有趣指标 uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread],它具有以下输出:

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    473,647      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    495,380      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    1,406,244      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    875,364      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    647,297      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    501,015      uops_issued.stall_cycles

它似乎与 SIX_BYTES_COMPUTATION on 和 WITH_NOP_BEFORE_DECL on or off 相对应,但我不确定 1) 为什么 nop 会在寄存器文件中节省空间。


我很确定这不是对齐问题,因为循环体的前 6 个字节和 .p2align 4 + decl 之间的 jnz {{ 1}} + decl 将位于不同的 16 字节对齐区域,并且性能差异取决于循环体中的内容(因此,如果它是对齐的东西,则循环不重要) body 是 nops 或计算)。


我认为这可能与某些依赖链问题有关但是因为如果我在循环结束时打破对 jnzv0 的依赖,那么 v1打开会导致性能下降。 我很可能错了,因为我不知道为什么循环结束前的 WITH_NOP_BEFORE_DECL 会影响任何依赖性问题。


它几乎肯定与端口调度无关。我在想可能有什么奇怪的事情发生,nop 偶然导致更好的调度,但在端口 1,2,5,6 上没有任何不同的 uop 与 out 或 nop on:

打开 WITH_NOP_BEFORE_DECL 和关闭 SIX_BYTES_COMPUTATION 的每个端口的说明:

WITH_NOP_BEFORE_DECL

SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [OFF] 1,147,196 uops_dispatched.port_0 1,114,665 uops_dispatched.port_1 1,138,238 uops_dispatched.port_5 1,266,212 uops_dispatched.port_6 开启和 SIX_BYTES_COMPUTATION 开启的每个端口的说明:

WITH_NOP_BEFORE_DECL


我的主要理论是寄存器重命名过程存在一些低效率,即没有 SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [ON] 1,177,092 uops_dispatched.port_0 1,081,734 uops_dispatched.port_1 1,103,314 uops_dispatched.port_5 1,296,546 uops_dispatched.port_6 的性能限制,幸运的是 nop 隐藏了这个问题,但我并不在对此充满信心。

谁能帮我理解这种行为。

编辑:完整的 cpp 代码和新时间,包括预热和 nop 之前的 lfence

新代码

rdtsc

新时代

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))

#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start,end;
    const uint64_t N        = (1UL << 24);
    const uint64_t WARMUP_N = N << 3;
    uint64_t       v0,loop_cnt;


    asm volatile(
        "xorl %k[v0],%k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
        "xorl %k[loop_cnt],%k[dst]\n\t"
        ".p2align 4\n\t"
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0),[ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc","memory");


    asm volatile("lfence\n\t" : : : "memory");
    start = _rdtsc();
    asm volatile(
        "xorl %k[v0],%k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt],%k[v1]\n\t"
#endif
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0),dif);
}


int
main(int argc,char ** argv) {
    bench();
}

新时代的趋势和以前一样,只是它们都快了很多。

编辑:Icelake 性能数据

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    0.645 "Cycles"

编辑:我确定它与依赖链或字节无关。它在某些地方添加一个 nop(非后端 uop)确实有助于提高性能。这是我认为非常清楚地证明这一点的基准。

SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.681 "Cycles"
    18,522,385,353      lsd.uops                                                    
         1,038,665      idq.dsb_uops                                                
     4,270,402,172      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
    0.778 "Cycles"
    20,669,567,680      lsd.uops                                                    
         1,049,193      idq.dsb_uops                                                
     4,807,261,565      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.734 "Cycles"
    12,080,048,840      lsd.uops                                                    
         1,035,128      idq.dsb_uops                                                
     4,552,666,461      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.659 "Cycles"
    14,232,154,418      lsd.uops                                                    
         1,150,777      idq.dsb_uops                                                
     4,134,501      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
    0.735 "Cycles"
    12,166,963      lsd.uops                                                    
           982,311      idq.dsb_uops                                                
     4,553,457,015      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
    0.644 "Cycles"
    16,374,872,770      lsd.uops                                                    
         1,022,379      idq.dsb_uops                                                
     4,055,306,960      cpu-cycles                                                  

结果:您基本上可以看到它在后端性能中不会执行的 1 uop 是否为 ~.39 ref-cycles 5 uop 循环的迭代(ICL 前端宽度)。否则没有 NOP 或异或归零填充,它的 ~.54 ref-cycles 4-uop 循环的迭代:

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>


#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif

#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif

#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))
void BENCH_FUNC_ATTR
bench() {
    uint64_t          start,end;
    const uint64_t    N        = (1UL << 31);
    const uint64_t    WARMUP_N = N >> 3;
    register uint64_t v0 asm("rdi");
    register uint64_t v1 asm("rsi");
    register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
    register uint64_t v3 asm("rax");
#endif
    register uint64_t loop_cnt asm("rcx");


    asm volatile(
        "xorl %k[v0],%k[v1]\n\t"
        "xorl %k[v2],%k[v2]\n\t"
#if COMPUTE_OUP
        "xorl %k[v3],%k[v3]\n\t"
#endif
        "movl %[N],%k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0],%k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0],%k[v0]\n\t"
#elif COMPUTE_UOP
        "incl %k[v3]\n\t"
#endif

        "incl %k[v0]\n\t"
        "incl %k[v1]\n\t"
        "incl %k[v2]\n\t"

        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0),[ v2 ] "=&r"(v2),#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc","memory");


    start = _rdtsc();
    asm volatile(
        "xorl %k[v0],#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc","memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf("UOP         -> %d\n",UOP);
    printf("BYTE        -> %d\n",BYTE);
    printf("NOP         -> %d\n",NOP);
    printf("BREAK_DEP   -> %d\n",BREAK_DEP);
    printf("COMPUTE_UOP -> %d\n",COMPUTE_UOP);
    printf("%.3lf \"Cycles\"\n",char ** argv) {
    bench();
}

运行脚本(已修复):

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,840,894      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257      idq_uops_not_delivered.cycles_fe_was_ok                                   
    4,791,034      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,411,711      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,915,776      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,334      idq_uops_not_delivered.cycles_fe_was_ok                                   
  215,743      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,735      idq_uops_not_delivered.cycles_fe_was_ok                                   
  214,953,593      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924      idq_uops_not_delivered.cycles_fe_was_ok                                   
    7,883,534      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.395 "Cycles"
2,440,570      idq_uops_not_delivered.cycles_fe_was_ok                                   
   26,095,530      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.520 "Cycles"
2,821,992,876      idq_uops_not_delivered.cycles_fe_was_ok                                   
    4,762,782      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 1
0.624 "Cycles"
3,864,366,562      idq_uops_not_delivered.cycles_fe_was_ok                                   
1,450,508,248      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.539 "Cycles"
2,947,391,859      idq_uops_not_delivered.cycles_fe_was_ok                                   
1,341,303,591      uops_issued.stall_cycles   

编辑:有趣的事情。看来 import os import sys fname = "test-nop" if (len(sys.argv) > 1): fname = sys.argv[1] build_cmd = "g++ -DUOP={} -DBYTE={} -DNOP={} -DBREAK_DEP={} -DCOMPUTE_UOP={} -O3 -std=c++17 -march=native -mtune=native " + fname + ".cc -o " + fname run_cmd = "perf stat -e idq_uops_not_delivered.cycles_fe_was_ok -e uops_issued.stall_cycles ./{}" zero_one = [0,1] uop = [1,4] byte = [1,4] nop = [1] break_dep = [1] compute_uop = [1] for n in nop: for u in uop: for b in byte: if b < u: continue os.system(build_cmd.format(u,b,n,0)) os.system(run_cmd.format(fname)) for bd in break_dep: for u in uop: for b in byte: if b != 2 * u: continue if b < u: continue os.system(build_cmd.format(u,bd,0)) os.system(run_cmd.format(fname)) os.system(build_cmd.format(1,1)) os.system(run_cmd.format(fname)) os.system(build_cmd.format(0,0)) os.system(run_cmd.format(fname)) 要使 5-uop 循环的性能优于 4-uop 循环,放置很重要。零习语 nop,然而总是能提高性能。以下是我们看到 5-uop 循环执行的 4 种情况的数量,其中 xorl / nop 在不同点交错。 xorl 版本仅在第一条指令时有改进,而 nop 版本始终具有性能改进。考虑到 xorl 在帮助的第一个结果,这有点奇怪。我唯一能想到的是,位置可能会影响事物在 uop 缓存或 LSD 缓冲区中的放置位置?

数字

nop

编辑:循环中具有 4 个独立 ################################################################ <nop,xorl,etc...> incl incl incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.389 "Cycles" 2,418,941,957 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.389 "Cycles" 2,490,126 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.390 "Cycles" 2,125,302 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 2 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.553 "Cycles" 3,033,520,044 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.394 "Cycles" 2,442,515,834 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl <nop,etc...> incl incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.566 "Cycles" 3,390,955,219 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.563 "Cycles" 3,373,556,409 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.565 "Cycles" 3,380,145,525 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.391 "Cycles" 2,428,978,799 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl incl <nop,etc...> incl decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.564 "Cycles" 3,377,709,071 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.563 "Cycles" 3,494,813 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.564 "Cycles" 3,019,951 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.389 "Cycles" 2,319,618 idq_uops_not_delivered.cycles_fe_was_ok ################################################################ incl incl incl <nop,etc...> decl jnz UOP -> 1 BYTE -> 1 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.556 "Cycles" 3,329,607,623 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.559 "Cycles" 3,340,246,297 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 4 NOP -> 1 BREAK_DEP -> 0 COMPUTE_UOP -> 0 0.553 "Cycles" 3,254,092 idq_uops_not_delivered.cycles_fe_was_ok ---------------------------------------------------------- UOP -> 1 BYTE -> 2 NOP -> 0 BREAK_DEP -> 1 COMPUTE_UOP -> 0 0.543 "Cycles" 3,279,214,443 idq_uops_not_delivered.cycles_fe_was_ok 指令的试验数据。使其成为带有 incl 的 6 uop 循环或不带 nop 的 5uop 循环。在以下情况下添加第 6 个 uop 时,我能够看到可测量且可重现的性能改进(更适度):如果第 6 个 uop 是 nop(1、2 或 4 个字节),则它必须介于第一个和第二个incl。如果第 6 个 uop 是零习语 xor 它可以在任何地方。以下是第 6 条指令在第 1 条和第 2 条incl 之间时的结果:

循环看起来像:

incl
<6th instruction>
incl
incl
incl
decl
jnz

次数:

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.603 "Cycles"
3,242,400,541      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.604 "Cycles"
3,244,473,075      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.601 "Cycles"
3,239,305,874      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.641 "Cycles"
3,330,250      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.649 "Cycles"
3,334,019      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.788 "Cycles"
3,989,749,825      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.551 "Cycles"
2,893,829,059      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.604 "Cycles"
3,007,481,786      idq_uops_not_delivered.cycles_fe_was_ok                                   

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.620 "Cycles"
3,755,030,033      idq_uops_not_delivered.cycles_fe_was_ok                                   

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res