程序超过理论记忆传输率

Question

程序超过理论记忆传输率

Ant*_*nio 18 java memory hardware performance benchmarking

我的笔记本电脑配备Intel Core 2 Duo 2.4GHz CPU和2x4Gb DDR3模块1066MHz.

我希望这个内存可以以1067 MiB/sec的速度运行,并且只要有两个通道,最大速度为2134 MiB/sec(如果OS内存调度程序允许的话).

我做了一个小Java应用程序来测试:

private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];

private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec

public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Random rnd = new Random();
    byte[] buf1 = new byte[s];
    rnd.nextBytes(buf1);
    long count = 0;
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(buf1, 0, storage, (int) begin, s);
        ++count;
    }
    double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    double speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");

    byte[] buf2 = new byte[s];
    count = 0;
    start = System.currentTimeMillis();
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(storage, (int) begin, buf2, 0, s);
        Arrays.fill(buf2, (byte) 0);
        ++count;
    }
    totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}

Run Code Online (Sandbox Code Playgroud)

我预计结果将低于2134 MiB/sec但是我得到以下结果:

17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)

Run Code Online (Sandbox Code Playgroud)

怎么可能速度几乎是3 GiB /秒？

DDR3模块照片

Answer 1

Tur*_*g85 20

这里有很多工作要做.

首先:DDR3的内存传输速率的公式是

memory clock rate
× 4  (for bus clock multiplier)
× 2  (for data rate)
× 64 (number of bits transferred)
/ 8  (number of bits/byte)
=    memory clock rate × 64 (in MB/s)

Run Code Online (Sandbox Code Playgroud)

对于DDR3-1066(以时钟为单位133? MHz),我们获得理论存储器带宽8533? MB/s或8138.02083333... MiB/s单通道和/ 17066? MB/s 或16276.0416666... MiB/s双通道.

第二:传输一大块数据比传输许多小块数据更快.

第三:你可以忽略缓存效果.

第四:如果你做时间测量,你应该使用System.nanoTime().这种方法更精确.

这是测试程序¹的重写版本.

import java.util.Random;

public class Main {

  public static void main(String... args) {
    final int SIZE = 1024 * 1024 * 1024;
    final int RUNS = 8;
    final int THREADS = 8;
    final int TSIZE = SIZE / THREADS;
    assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
    byte[] src = new byte[SIZE];
    byte[] dest = new byte[SIZE];
    Random r = new Random();
    long timeNano = 0;

    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < RUNS; ++i) {
      System.out.print("Initializing src... ");
      for (int idx = 0; idx < SIZE; ++idx) {
        src[idx] = ((byte) r.nextInt(256));
      }
      System.out.println("done!");
      System.out.print("Starting test... ");
      for (int idx = 0; idx < THREADS; ++idx) {
        final int from = TSIZE * idx;
        threads[idx]
            = new Thread(() -> {
          System.arraycopy(src, from, dest, 0, TSIZE);
        });
      }
      long start = System.nanoTime();
      for (int idx = 0; idx < THREADS; ++idx) {
        threads[idx].start();
      }
      for (int idx = 0; idx < THREADS; ++idx) {
        try {
          threads[idx].join();
        } catch (InterruptedException e) {
          e.printStackTrace();
        }
      }
      timeNano += System.nanoTime() - start;
      System.out.println("done!");
    }
    double timeSecs = timeNano / 1_000_000_000d;

    System.out.println("Transfered " + (long) SIZE * RUNS
        + " bytes in " + timeSecs + " seconds.");

    System.out.println("-> "
        + ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
        + " GiB/s");
  }
}

Run Code Online (Sandbox Code Playgroud)

这样,您可以减少尽可能多的"其他计算",并且(几乎)仅测量内存复制率System.arraycopy(...).该算法可能仍然存在关于高速缓存的问题.

对于我的系统(双通道DDR3-1600),我得到了一些东西6 GiB/s,而理论上的限制是在25 GiB/s(包括DualChannel).

正如MagicM18所指出的,JVM引入了一些开销.因此,预计您无法达到理论极限.

_{¹旁注:要运行程序,必须为JVM提供更多的堆空间.就我而言,4096 MB就足够了.}

@Antonio为什么要减小`SIZE`？为了接近理论极限，块应该尽可能大。最好给JVM更多的堆空间（`java -Xmx4096m ...`），而不是减小块大小。 (2认同)
@Antonio您的理论极限(包括双通道)约为16 GiB/s,因此不会超出理论极限.请记住,我的基准测试必须将所有内容传输两次(从内存到CPU以及从CPU返回到内存).所以我的程序实际上转换2.84 GiB/s.但是,您的基准测试可能会从缓存中获利很大(您的源数组大小只有1KB,因此可能完全缓存).所以基本上两个基准测试都显示出相同的性能. (2认同)

Answer 2

Dur*_*dal 8

您的测试方法在很多方面都设计不合理,以及您对RAM等级的解释.

让我们从评级开始; 自SDRam推出以来,营销人员在其总线规范之后命名模块 - 即总线时钟频率,与突发传输速率配对.这是最好的情况,在实践中它不能持续持续.

该标签省略的参数是实际访问时间(也称为延迟)和总循环时间(也称为预充电时间).这些可以通过实际查看"时间"规格(2-3-3的东西)来计算出来.查找一篇详细解释这些内容的文章.实际上,CPU通常不传输单个字节,而是整个高速缓存行(例如,每8个字节8个条目= 64个字节).

您的测试代码设计不合理,因为您使用相对较小的块进行随机访问,与实际数据边界不对齐.这种随机访问还会导致MMU中频繁的页面未命中(了解TLB是什么/做什么).因此,您正在测量不同系统方面的混合物.

实际上,我的目标是在所需的内存块不在缓存中时测试速度.让我说什么时间,我会回来.谢谢你的回答. (2认同)
@Antonio从对你的问题的评论中提到的wiki条目开始,CL代表*列访问延迟*,这只是其中一个参数.模块"知道"他们的时间,即主板的BIOS读取参数并调整其timinigs以匹配RAM.有些工具可以显示这些值.对于"公式"如何从时间中巧妙地计算有用的东西,严肃地从维基开始,也许补充它与DRAM访问*的基础*.它是一个相对复杂和*宽*的主题.我自己并不了解每一个细节. (2认同)

归档时间：	10 年，5 月前
查看次数：	739 次
最近记录：	6 年，6 月前