如何解决为什么 NOP作为 5th uop会加速 Ice Lake 上的 4 uop 循环?
所有基准测试都在:Icelake:Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)
编辑:我无法在 Broadwell 上重现它,@PeterCordes 也无法在 Skylake 上重现它
我试图对执行整数 min(a,b)
的不同方法进行基准测试,但遇到了一些无法解释的行为,我已将其归结为以下基准:
#define BENCH_FUNC_ATTR __attribute__((aligned(64),noinline,noclone))
#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL 0
#define BREAK_DEPENDENCY 0
void BENCH_FUNC_ATTR
bench() {
uint64_t start,end;
const uint64_t N = 1000000;
start = _rdtsc();
uint64_t v0,v1,dst,loop_cnt;
asm volatile(
"xorl %k[v0],%k[v0]\n\t"
"movl $1,%k[v1]\n\t"
"movl %[N],%k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt],%k[v0]\n\t"
"xorl %k[loop_cnt],%k[v1]\n\t"
"movl %k[v0],%k[dst]\n\t"
#else
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#endif
".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
"nop\n\t"
#endif
#if BREAK_DEPENDENCY
"xorl %k[v0],%k[v0]\n\t"
"xorl %k[v1],%k[v1]\n\t"
#endif
// macro-fusion is NOT broken
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0),[ v1 ] "=&r"(v1),[ dst ] "=&r"(dst),[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc","memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf(
"SIX_BYTES_COMPUTATION - [%s],WITH_NOP_BEFORE_DECL - [%s],"
"BREAK_DEPENDENCY - [%s]\n\t",SIX_BYTES_COMPUTATION ? "ON" : "OFF",WITH_NOP_BEFORE_DECL ? "ON" : "OFF",BREAK_DEPENDENCY ? "ON" : "OFF");
printf("%.3lf \"Cycles\"\n",dif);
}
打开 WITH_NOP_BEFORE_DECL
以便在 nop
+ decl
之前有一个 jnz
会在打开 SIX_BYTES_COMPUTATION
时导致可衡量的性能改进,但会导致可衡量的SIX_BYTES_COMPUTATION
关闭时性能下降。
这是数字:
SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [OFF],BREAK_DEPENDENCY - [OFF]
2.080 "Cycles" <--- Just 6 nops
SIX_BYTES_COMPUTATION - [OFF],WITH_NOP_BEFORE_DECL - [ON],BREAK_DEPENDENCY - [OFF]
2.363 "Cycles" <--- Performance degradation from previous
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
2.185 "Cycles" <--- Computation then decl + jnz
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
1.945 "Cycles" <--- Performance improvement from previous
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
1.919 "Cycles" <--- Breaking dependencies has best performance
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
2.046 "Cycles" <--- nop hurts performance when breaking dependencies
这可能与注册文件填满有关?我找到了一个潜在的有趣指标 uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread]
,它具有以下输出:
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
473,647 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
495,380 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
1,406,244 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
875,364 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
647,297 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
501,015 uops_issued.stall_cycles
它似乎与 SIX_BYTES_COMPUTATION
on 和 WITH_NOP_BEFORE_DECL
on or off 相对应,但我不确定 1) 为什么 nop
会在寄存器文件中节省空间。
我很确定这不是对齐问题,因为循环体的前 6 个字节和 .p2align 4
+ decl
之间的 jnz
{{ 1}} + decl
将位于不同的 16 字节对齐区域,并且性能差异取决于循环体中的内容(因此,如果它是对齐的东西,则循环不重要) body 是 nops 或计算)。
我认为这可能与某些依赖链问题有关但是因为如果我在循环结束时打破对 jnz
和 v0
的依赖,那么 v1
打开会导致性能下降。 我很可能错了,因为我不知道为什么循环结束前的 WITH_NOP_BEFORE_DECL
会影响任何依赖性问题。
它几乎肯定与端口调度无关。我在想可能有什么奇怪的事情发生,nop
偶然导致更好的调度,但在端口 1,2,5,6 上没有任何不同的 uop 与 out 或 nop
on:
打开 WITH_NOP_BEFORE_DECL
和关闭 SIX_BYTES_COMPUTATION
的每个端口的说明:
WITH_NOP_BEFORE_DECL
SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [OFF]
1,147,196 uops_dispatched.port_0
1,114,665 uops_dispatched.port_1
1,138,238 uops_dispatched.port_5
1,266,212 uops_dispatched.port_6
开启和 SIX_BYTES_COMPUTATION
开启的每个端口的说明:
WITH_NOP_BEFORE_DECL
我的主要理论是寄存器重命名过程存在一些低效率,即没有 SIX_BYTES_COMPUTATION - [ON],WITH_NOP_BEFORE_DECL - [ON]
1,177,092 uops_dispatched.port_0
1,081,734 uops_dispatched.port_1
1,103,314 uops_dispatched.port_5
1,296,546 uops_dispatched.port_6
的性能限制,幸运的是 nop
隐藏了这个问题,但我并不在对此充满信心。
谁能帮我理解这种行为。
编辑:完整的 cpp 代码和新时间,包括预热和 nop
之前的 lfence
。
新代码
rdtsc
新时代
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))
#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
uint64_t start,end;
const uint64_t N = (1UL << 24);
const uint64_t WARMUP_N = N << 3;
uint64_t v0,loop_cnt;
asm volatile(
"xorl %k[v0],%k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
"xorl %k[loop_cnt],%k[dst]\n\t"
".p2align 4\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0),[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc","memory");
asm volatile("lfence\n\t" : : : "memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0],%k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt],%k[v1]\n\t"
#endif
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0),dif);
}
int
main(int argc,char ** argv) {
bench();
}
新时代的趋势和以前一样,只是它们都快了很多。
编辑:Icelake 性能数据
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
0.645 "Cycles"
编辑:我确定它与依赖链或字节无关。它在某些地方添加一个 nop(非后端 uop)确实有助于提高性能。这是我认为非常清楚地证明这一点的基准。
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
0.681 "Cycles"
18,522,385,353 lsd.uops
1,038,665 idq.dsb_uops
4,270,402,172 cpu-cycles
SIX_BYTES_COMPUTATION - [OFF],BREAK_DEPENDENCY - [OFF]
0.778 "Cycles"
20,669,567,680 lsd.uops
1,049,193 idq.dsb_uops
4,807,261,565 cpu-cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.734 "Cycles"
12,080,048,840 lsd.uops
1,035,128 idq.dsb_uops
4,552,666,461 cpu-cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.659 "Cycles"
14,232,154,418 lsd.uops
1,150,777 idq.dsb_uops
4,134,501 cpu-cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [OFF]
0.735 "Cycles"
12,166,963 lsd.uops
982,311 idq.dsb_uops
4,553,457,015 cpu-cycles
SIX_BYTES_COMPUTATION - [ON],BREAK_DEPENDENCY - [ON]
0.644 "Cycles"
16,374,872,770 lsd.uops
1,022,379 idq.dsb_uops
4,055,306,960 cpu-cycles
结果:您基本上可以看到它在后端性能中不会执行的 1 uop 是否为 ~.39 ref-cycles 5 uop 循环的迭代(ICL 前端宽度)。否则没有 NOP 或异或归零填充,它的 ~.54 ref-cycles 4-uop 循环的迭代:
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif
#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif
#define BENCH_FUNC_ATTR __attribute__((aligned(64),noclone))
void BENCH_FUNC_ATTR
bench() {
uint64_t start,end;
const uint64_t N = (1UL << 31);
const uint64_t WARMUP_N = N >> 3;
register uint64_t v0 asm("rdi");
register uint64_t v1 asm("rsi");
register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
register uint64_t v3 asm("rax");
#endif
register uint64_t loop_cnt asm("rcx");
asm volatile(
"xorl %k[v0],%k[v1]\n\t"
"xorl %k[v2],%k[v2]\n\t"
#if COMPUTE_OUP
"xorl %k[v3],%k[v3]\n\t"
#endif
"movl %[N],%k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if UOP == 1 && BYTE == 1 && NOP == 1
"nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
"xchg %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
"nopl 0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
"xchg %%ax,%%ax\n\t"
"xchg %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
"xorl %k[v0],%k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
"xorl %k[v0],%k[v0]\n\t"
#elif COMPUTE_UOP
"incl %k[v3]\n\t"
#endif
"incl %k[v0]\n\t"
"incl %k[v1]\n\t"
"incl %k[v2]\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0),[ v2 ] "=&r"(v2),#if COMPUTE_UOP
[ v3 ] "=&r"(v3),#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc","memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0],#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc","memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf("UOP -> %d\n",UOP);
printf("BYTE -> %d\n",BYTE);
printf("NOP -> %d\n",NOP);
printf("BREAK_DEP -> %d\n",BREAK_DEP);
printf("COMPUTE_UOP -> %d\n",COMPUTE_UOP);
printf("%.3lf \"Cycles\"\n",char ** argv) {
bench();
}
运行脚本(已修复):
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801 idq_uops_not_delivered.cycles_fe_was_ok
5,840,894 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257 idq_uops_not_delivered.cycles_fe_was_ok
4,791,034 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,411,711 idq_uops_not_delivered.cycles_fe_was_ok
5,915,776 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,334 idq_uops_not_delivered.cycles_fe_was_ok
215,743 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,735 idq_uops_not_delivered.cycles_fe_was_ok
214,953,593 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 4
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924 idq_uops_not_delivered.cycles_fe_was_ok
7,883,534 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.395 "Cycles"
2,440,570 idq_uops_not_delivered.cycles_fe_was_ok
26,095,530 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 4
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.520 "Cycles"
2,821,992,876 idq_uops_not_delivered.cycles_fe_was_ok
4,762,782 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 1
0.624 "Cycles"
3,864,366,562 idq_uops_not_delivered.cycles_fe_was_ok
1,450,508,248 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 0
BYTE -> 0
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.539 "Cycles"
2,947,391,859 idq_uops_not_delivered.cycles_fe_was_ok
1,341,303,591 uops_issued.stall_cycles
编辑:有趣的事情。看来 import os
import sys
fname = "test-nop"
if (len(sys.argv) > 1):
fname = sys.argv[1]
build_cmd = "g++ -DUOP={} -DBYTE={} -DNOP={} -DBREAK_DEP={} -DCOMPUTE_UOP={} -O3 -std=c++17 -march=native -mtune=native " + fname + ".cc -o " + fname
run_cmd = "perf stat -e idq_uops_not_delivered.cycles_fe_was_ok -e uops_issued.stall_cycles ./{}"
zero_one = [0,1]
uop = [1,4]
byte = [1,4]
nop = [1]
break_dep = [1]
compute_uop = [1]
for n in nop:
for u in uop:
for b in byte:
if b < u:
continue
os.system(build_cmd.format(u,b,n,0))
os.system(run_cmd.format(fname))
for bd in break_dep:
for u in uop:
for b in byte:
if b != 2 * u:
continue
if b < u:
continue
os.system(build_cmd.format(u,bd,0))
os.system(run_cmd.format(fname))
os.system(build_cmd.format(1,1))
os.system(run_cmd.format(fname))
os.system(build_cmd.format(0,0))
os.system(run_cmd.format(fname))
要使 5-uop 循环的性能优于 4-uop 循环,放置很重要。零习语 nop
,然而总是能提高性能。以下是我们看到 5-uop 循环执行的 4 种情况的数量,其中 xorl
/ nop
在不同点交错。 xorl
版本仅在第一条指令时有改进,而 nop
版本始终具有性能改进。考虑到 xorl
在帮助的第一个结果,这有点奇怪。我唯一能想到的是,位置可能会影响事物在 uop 缓存或 LSD 缓冲区中的放置位置?
数字:
nop
编辑:循环中具有 4 个独立 ################################################################
<nop,xorl,etc...>
incl
incl
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,418,941,957 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,490,126 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.390 "Cycles"
2,125,302 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,033,520,044 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.394 "Cycles"
2,442,515,834 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
<nop,etc...>
incl
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.566 "Cycles"
3,390,955,219 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,373,556,409 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.565 "Cycles"
3,380,145,525 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.391 "Cycles"
2,428,978,799 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
incl
<nop,etc...>
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,377,709,071 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,494,813 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,019,951 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.389 "Cycles"
2,319,618 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
incl
incl
<nop,etc...>
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.556 "Cycles"
3,329,607,623 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.559 "Cycles"
3,340,246,297 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,254,092 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.543 "Cycles"
3,279,214,443 idq_uops_not_delivered.cycles_fe_was_ok
指令的试验数据。使其成为带有 incl
的 6 uop 循环或不带 nop
的 5uop 循环。在以下情况下添加第 6 个 uop 时,我能够看到可测量且可重现的性能改进(更适度):如果第 6 个 uop 是 nop
(1、2 或 4 个字节),则它必须介于第一个和第二个incl
。如果第 6 个 uop 是零习语 xor
它可以在任何地方。以下是第 6 条指令在第 1 条和第 2 条incl
之间时的结果:
循环看起来像:
incl
<6th instruction>
incl
incl
incl
decl
jnz
次数:
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.603 "Cycles"
3,242,400,541 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.604 "Cycles"
3,244,473,075 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.601 "Cycles"
3,239,305,874 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.641 "Cycles"
3,330,250 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.649 "Cycles"
3,334,019 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 4
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.788 "Cycles"
3,989,749,825 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.551 "Cycles"
2,893,829,059 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 4
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.604 "Cycles"
3,007,481,786 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 0
BYTE -> 0
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.620 "Cycles"
3,755,030,033 idq_uops_not_delivered.cycles_fe_was_ok
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。