Ant*_*nio 18 java memory hardware performance benchmarking
我的笔记本电脑配备Intel Core 2 Duo 2.4GHz CPU和2x4Gb DDR3模块1066MHz.
我希望这个内存可以以1067 MiB/sec的速度运行,并且只要有两个通道,最大速度为2134 MiB/sec(如果OS内存调度程序允许的话).
我做了一个小Java应用程序来测试:
private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];
private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec
public static void main(String[] args) {
long start = System.currentTimeMillis();
Random rnd = new Random();
byte[] buf1 = new byte[s];
rnd.nextBytes(buf1);
long count = 0;
while (System.currentTimeMillis() - start < duration * 1000) {
long begin = (long) (rnd.nextDouble() * (size - s));
System.arraycopy(buf1, 0, storage, (int) begin, s);
++count;
}
double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
double speed = count * s / totalSeconds / 1024 / 1024;
System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
byte[] buf2 = new byte[s];
count = 0;
start = System.currentTimeMillis();
while (System.currentTimeMillis() - start < duration * 1000) {
long begin = (long) (rnd.nextDouble() * (size - s));
System.arraycopy(storage, (int) begin, buf2, 0, s);
Arrays.fill(buf2, (byte) 0);
++count;
}
totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
speed = count * s / totalSeconds / 1024 / 1024;
System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}
Run Code Online (Sandbox Code Playgroud)
我预计结果将低于2134 MiB/sec但是我得到以下结果:
17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)
Run Code Online (Sandbox Code Playgroud)
怎么可能速度几乎是3 GiB /秒?

Tur*_*g85 20
这里有很多工作要做.
首先:DDR3的内存传输速率的公式是
memory clock rate
× 4 (for bus clock multiplier)
× 2 (for data rate)
× 64 (number of bits transferred)
/ 8 (number of bits/byte)
= memory clock rate × 64 (in MB/s)
Run Code Online (Sandbox Code Playgroud)
对于DDR3-1066(以时钟为单位133? MHz),我们获得理论存储器带宽8533? MB/s或8138.02083333... MiB/s单通道和/ 17066? MB/s 或16276.0416666... MiB/s双通道.
第二:传输一大块数据比传输许多小块数据更快.
第三:你可以忽略缓存效果.
第四:如果你做时间测量,你应该使用System.nanoTime().这种方法更精确.
这是测试程序1的重写版本.
import java.util.Random;
public class Main {
public static void main(String... args) {
final int SIZE = 1024 * 1024 * 1024;
final int RUNS = 8;
final int THREADS = 8;
final int TSIZE = SIZE / THREADS;
assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
byte[] src = new byte[SIZE];
byte[] dest = new byte[SIZE];
Random r = new Random();
long timeNano = 0;
Thread[] threads = new Thread[THREADS];
for (int i = 0; i < RUNS; ++i) {
System.out.print("Initializing src... ");
for (int idx = 0; idx < SIZE; ++idx) {
src[idx] = ((byte) r.nextInt(256));
}
System.out.println("done!");
System.out.print("Starting test... ");
for (int idx = 0; idx < THREADS; ++idx) {
final int from = TSIZE * idx;
threads[idx]
= new Thread(() -> {
System.arraycopy(src, from, dest, 0, TSIZE);
});
}
long start = System.nanoTime();
for (int idx = 0; idx < THREADS; ++idx) {
threads[idx].start();
}
for (int idx = 0; idx < THREADS; ++idx) {
try {
threads[idx].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
timeNano += System.nanoTime() - start;
System.out.println("done!");
}
double timeSecs = timeNano / 1_000_000_000d;
System.out.println("Transfered " + (long) SIZE * RUNS
+ " bytes in " + timeSecs + " seconds.");
System.out.println("-> "
+ ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
+ " GiB/s");
}
}
Run Code Online (Sandbox Code Playgroud)
这样,您可以减少尽可能多的"其他计算",并且(几乎)仅测量内存复制率System.arraycopy(...).该算法可能仍然存在关于高速缓存的问题.
对于我的系统(双通道DDR3-1600),我得到了一些东西6 GiB/s,而理论上的限制是在25 GiB/s(包括DualChannel).
正如MagicM18所指出的,JVM引入了一些开销.因此,预计您无法达到理论极限.
1旁注:要运行程序,必须为JVM提供更多的堆空间.就我而言,4096 MB就足够了.
您的测试方法在很多方面都设计不合理,以及您对RAM等级的解释.
让我们从评级开始; 自SDRam推出以来,营销人员在其总线规范之后命名模块 - 即总线时钟频率,与突发传输速率配对.这是最好的情况,在实践中它不能持续持续.
该标签省略的参数是实际访问时间(也称为延迟)和总循环时间(也称为预充电时间).这些可以通过实际查看"时间"规格(2-3-3的东西)来计算出来.查找一篇详细解释这些内容的文章.实际上,CPU通常不传输单个字节,而是整个高速缓存行(例如,每8个字节8个条目= 64个字节).
您的测试代码设计不合理,因为您使用相对较小的块进行随机访问,与实际数据边界不对齐.这种随机访问还会导致MMU中频繁的页面未命中(了解TLB是什么/做什么).因此,您正在测量不同系统方面的混合物.