如何解决用于矩阵乘法的perf工具中的高速缓存引用数量不同的原因
我们尝试了3种矩阵相乘的2种不同方式。第一种方法是一次进行乘法,第二种方法是一次接一个地进行乘法。我们想知道两种情况下高速缓存引用数量不同的原因。使用以下命令收集了观察结果:
sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations java Multiplication
这是我们运行的代码:
import java.util.Random;
public class Multiplication {
public static void main(String args[]){
int dim = 500; // Dimension of the Matrices
Multiplication multiplication = new Multiplication();
double[][] matrix1 = multiplication.createMatrix(dim);
double[][] matrix2 = multiplication.createMatrix(dim);
double[][] matrix3 = multiplication.createMatrix(dim);
long start = System.currentTimeMillis();
// multiplication.multiply(matrix1,matrix2,matrix3); // Multiplying at a time
multiplication.multiply(matrix1,matrix2); // Multiplying one
multiplication.multiply(matrix1,matrix3); // after another
long end = System.currentTimeMillis();
System.out.println((end - start) + " ms");
}
public void multiply(double[][] mat1,double[][] mat2) {
int size = mat1.length;
double[][] product = new double[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
for (int k = 0; k < size; k++) {
product[i][j] += mat1[i][k] * mat2[k][j];
}
}
}
}
public void multiply(double[][] mat1,double[][] mat2,double[][] mat3){
int size = mat1.length;
double[][] product1 = new double[size][size];
double[][] product2 = new double[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
for (int k = 0; k < size; k++) {
product1[i][j] += mat1[i][k] * mat2[k][j];
product2[i][j] += mat1[i][k] * mat3[k][j];
}
}
}
}
public double[][] createMatrix(int dim) {
double[][] mat = new double[dim][dim];
Random r = new Random();
for (int i = 0; i < dim; i++) {
for (int j = 0; j < dim; j++) {
mat[i][j] = r.nextDouble();
}
}
return mat;
}
}
以下是观察结果:
观察是针对4个不同维度(500、1000、1500、2000)的,它们在第一行中是写的,第二行是时间,其余是性能统计结果。
一次:
500
398 ms
Performance counter stats for 'java Multiplication':
411,915,294 cache-references
1,399,148 cache-misses # 0.340 % of all cache refs
2,042,614,097 cycles
3,437,759,396 instructions # 1.68 insn per cycle
475,939,754 branches
5,627 faults
10 migrations
0.455140444 seconds time elapsed
0.476102000 seconds user
0.011804000 seconds sys
1000
11993 ms
Performance counter stats for 'java Multiplication':
11,704,565,015 cache-references
333,223,106 cache-misses # 2.847 % of all cache refs
52,324,922,900 cycles
23,763,328,245 instructions # 0.45 insn per cycle
3,124,574,763 branches
12,999 faults
9 migrations
12.108233932 seconds time elapsed
12.086936000 seconds user
0.059856000 seconds sys
1500
53069 ms
Performance counter stats for 'java Multiplication':
47,426,797,689 cache-references
1,435,495,676 cache-misses # 3.027 % of all cache refs
229,231,012,094 cycles
78,724,694,438 instructions # 0.34 insn per cycle
10,291,361,842 branches
34,704 faults
28 migrations
53.251270984 seconds time elapsed
53.135376000 seconds user
0.143987000 seconds sys
2000
148669 ms
Performance counter stats for 'java Multiplication':
122,810,341,708 cache-references
3,628,091,933 cache-misses # 2.954 % of all cache refs
626,161,537,985 cycles
185,767,651,022 instructions # 0.30 insn per cycle
24,266,992,254 branches
58,795 faults
127 migrations
148.950934773 seconds time elapsed
149.186738000 seconds user
0.243997000 seconds sys
500
388 ms
Performance counter stats for 'java Multiplication':
146,687,848 cache-references
1,556,581 cache-misses # 1.061 % of all cache refs
1,971,147,110 cycles
3,270,586,420 instructions # 1.66 insn per cycle
467,409,752 branches
5,450 faults
13 migrations
0.483391040 seconds time elapsed
0.505959000 seconds user
0.008294000 seconds sys
1000
6470 ms
Performance counter stats for 'java Multiplication':
4,229,871,529 cache-references
18,662,663 cache-misses # 0.441 % of all cache refs
28,984,959,677 cycles
22,751,499,355 instructions # 0.78 insn per cycle
3,123,552,014 branches
12,842 faults
7 migrations
6.579694810 seconds time elapsed
6.600153000 seconds user
0.016010000 seconds sys
1500
48918 ms
Performance counter stats for 'java Multiplication':
38,902,944,785 cache-references
1,166,289,245 cache-misses # 2.998 % of all cache refs
213,672,056,122 cycles
75,416,350,352 instructions # 0.35 insn per cycle
10,306,363,917 branches
34,750 faults
26 migrations
49.121111199 seconds time elapsed
49.172314000 seconds user
0.088022000 seconds sys
2000
120057 ms
Performance counter stats for 'java Multiplication':
97,707,304,381 cache-references
3,208,749,714 cache-misses # 3.284 % of all cache refs
516,080,961,402 cycles
177,793,621,137 instructions # 0.34 insn per cycle
24,272,120,015 branches
45,235 faults
87 migrations
120.368185469 seconds time elapsed
120.439402000 seconds user
0.152049000 seconds sys
我们在观察中可以看到,在两种情况下,高速缓存引用都存在相当大的差距。为了确保这不是一时的错误,我们多次运行了案例并获得了相似的结果。我们无法找出造成缓存引用差异的原因,并希望获得相同的帮助。
谢谢
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。