提高性能一致性的方法

Pet*_*rey 46 java memory concurrency performance jvm

在下面的示例中,一个线程通过消费者正在采用的ByteBuffer发送"消息".最好的表现非常好,但不一致.

public class Main {
    public static void main(String... args) throws IOException {
        for (int i = 0; i < 10; i++)
            doTest();
    }

    public static void doTest() {
        final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
        final ByteBuffer readBuffer = writeBuffer.slice();
        final AtomicInteger readCount = new PaddedAtomicInteger();
        final AtomicInteger writeCount = new PaddedAtomicInteger();

        for(int i=0;i<3;i++)
            performTiming(writeBuffer, readBuffer, readCount, writeCount);
        System.out.println();
    }

    private static void performTiming(ByteBuffer writeBuffer, final ByteBuffer readBuffer, final AtomicInteger readCount, final AtomicInteger writeCount) {
        writeBuffer.clear();
        readBuffer.clear();
        readCount.set(0);
        writeCount.set(0);

        Thread t = new Thread(new Runnable() {
            @Override
            public void run() {
                byte[] bytes = new byte[128];
                while (!Thread.interrupted()) {
                    int rc = readCount.get(), toRead;
                    while ((toRead = writeCount.get() - rc) <= 0) ;
                    for (int i = 0; i < toRead; i++) {
                        byte len = readBuffer.get();
                        if (len == -1) {
                            // rewind.
                            readBuffer.clear();
//                            rc++;
                        } else {
                            int num = readBuffer.getInt();
                            if (num != rc)
                                throw new AssertionError("Expected " + rc + " but got " + num) ;
                            rc++;
                            readBuffer.get(bytes, 0, len - 4);
                        }
                    }
                    readCount.lazySet(rc);
                }
            }
        });
        t.setDaemon(true);
        t.start();
        Thread.yield();
        long start = System.nanoTime();
        int runs = 30 * 1000 * 1000;
        int len = 32;
        byte[] bytes = new byte[len - 4];
        int wc = writeCount.get();
        for (int i = 0; i < runs; i++) {
            if (writeBuffer.remaining() < len + 1) {
                // reader has to catch up.
                while (wc - readCount.get() > 0) ;
                // rewind.
                writeBuffer.put((byte) -1);
                writeBuffer.clear();
            }
            writeBuffer.put((byte) len);
            writeBuffer.putInt(i);
            writeBuffer.put(bytes);
            writeCount.lazySet(++wc);
        }
        // reader has to catch up.
        while (wc - readCount.get() > 0) ;
        t.interrupt();
        t.stop();
        long time = System.nanoTime() - start;
        System.out.printf("Message rate was %.1f M/s offsets %d %d %d%n", runs * 1e3 / time
                , addressOf(readBuffer) - addressOf(writeBuffer)
                , addressOf(readCount) - addressOf(writeBuffer)
                , addressOf(writeCount) - addressOf(writeBuffer)
        );
    }

    // assumes -XX:+UseCompressedOops.
    public static long addressOf(Object... o) {
        long offset = UNSAFE.arrayBaseOffset(o.getClass());
        return UNSAFE.getInt(o, offset) * 8L;
    }

    public static final Unsafe UNSAFE = getUnsafe();
    public static Unsafe getUnsafe() {
        try {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            return (Unsafe) field.get(null);
        } catch (Exception e) {
            throw new AssertionError(e);
        }
    }

    private static class PaddedAtomicInteger extends AtomicInteger {
        public long p2, p3, p4, p5, p6, p7;

        public long sum() {
//            return 0;
            return p2 + p3 + p4 + p5 + p6 + p7;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

打印同一数据块的计时.最后的数字是对象的相对地址,表明它们每次都在缓存中布局相同.运行更长时间的10次测试表明给定组合重复产生相同的性能.

Message rate was 63.2 M/s offsets 136 200 264
Message rate was 80.4 M/s offsets 136 200 264
Message rate was 80.0 M/s offsets 136 200 264

Message rate was 81.9 M/s offsets 136 200 264
Message rate was 82.2 M/s offsets 136 200 264
Message rate was 82.5 M/s offsets 136 200 264

Message rate was 79.1 M/s offsets 136 200 264
Message rate was 82.4 M/s offsets 136 200 264
Message rate was 82.4 M/s offsets 136 200 264

Message rate was 34.7 M/s offsets 136 200 264
Message rate was 39.1 M/s offsets 136 200 264
Message rate was 39.0 M/s offsets 136 200 264
Run Code Online (Sandbox Code Playgroud)

每组缓冲区和计数器都进行了三次测试,这些缓冲区似乎给出了类似的结果.所以我相信这些缓冲区在内存中的布局方式还有一些我没有看到的.

有没有什么可以更频繁地提供更高的性能?它看起来像一个缓存冲突,但我无法看到这可能发生的地方.

BTW:M/s每秒有数百万条消息,并且比任何人可能需要的消息都多,但了解如何使其始终如一地快速进行会很好.


编辑:使用synchronized with wait和notify使结果更加一致.但不是更快.

Message rate was 6.9 M/s
Message rate was 7.8 M/s
Message rate was 7.9 M/s
Message rate was 6.7 M/s
Message rate was 7.5 M/s
Message rate was 7.7 M/s
Message rate was 7.3 M/s
Message rate was 7.9 M/s
Message rate was 6.4 M/s
Message rate was 7.8 M/s
Run Code Online (Sandbox Code Playgroud)

编辑:通过使用任务集,如果我锁定两个线程以更改相同的核心,我可以使性能保持一致.

Message rate was 35.1 M/s offsets 136 200 216
Message rate was 34.0 M/s offsets 136 200 216
Message rate was 35.4 M/s offsets 136 200 216

Message rate was 35.6 M/s offsets 136 200 216
Message rate was 37.0 M/s offsets 136 200 216
Message rate was 37.2 M/s offsets 136 200 216

Message rate was 37.1 M/s offsets 136 200 216
Message rate was 35.0 M/s offsets 136 200 216
Message rate was 37.1 M/s offsets 136 200 216

If I use any two logical threads on different cores, I get the inconsistent behaviour

Message rate was 60.2 M/s offsets 136 200 216
Message rate was 68.7 M/s offsets 136 200 216
Message rate was 55.3 M/s offsets 136 200 216

Message rate was 39.2 M/s offsets 136 200 216
Message rate was 39.1 M/s offsets 136 200 216
Message rate was 37.5 M/s offsets 136 200 216

Message rate was 75.3 M/s offsets 136 200 216
Message rate was 73.8 M/s offsets 136 200 216
Message rate was 66.8 M/s offsets 136 200 216
Run Code Online (Sandbox Code Playgroud)

编辑:似乎触发GC将改变行为.这些显示在相同的缓冲液+计数器上重复测试,手动触发GC中途.

faster after GC

Message rate was 27.4 M/s offsets 136 200 216
Message rate was 27.8 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
Message rate was 27.7 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
[GC 14312K->1518K(244544K), 0.0003050 secs]
[Full GC 1518K->1328K(244544K), 0.0068270 secs]
Message rate was 34.7 M/s offsets 64 128 144
Message rate was 54.5 M/s offsets 64 128 144
Message rate was 54.1 M/s offsets 64 128 144
Message rate was 51.9 M/s offsets 64 128 144
Message rate was 57.2 M/s offsets 64 128 144

and slower

Message rate was 61.1 M/s offsets 136 200 216
Message rate was 61.8 M/s offsets 136 200 216
Message rate was 60.5 M/s offsets 136 200 216
Message rate was 61.1 M/s offsets 136 200 216
[GC 35740K->1440K(244544K), 0.0018170 secs]
[Full GC 1440K->1302K(244544K), 0.0071290 secs]
Message rate was 53.9 M/s offsets 64 128 144
Message rate was 54.3 M/s offsets 64 128 144
Message rate was 50.8 M/s offsets 64 128 144
Message rate was 56.6 M/s offsets 64 128 144
Message rate was 56.0 M/s offsets 64 128 144
Message rate was 53.6 M/s offsets 64 128 144
Run Code Online (Sandbox Code Playgroud)

编辑:使用@ BegemoT的库来打印使用的核心ID我在3.8 GHz i7(家用电脑)上获得以下内容

注意:偏移量不正确的因子为8.由于堆大小很小,JVM不会将引用乘以8,就像它对较大(但小于32 GB)的堆一样.

writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 54.4 M/s offsets 3392 3904 4416
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#6]
Message rate was 54.2 M/s offsets 3392 3904 4416
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 60.7 M/s offsets 3392 3904 4416

writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 25.5 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 25.9 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 26.0 M/s offsets 1088 1600 2112

writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 61.0 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 61.8 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 60.7 M/s offsets 1088 1600 2112
Run Code Online (Sandbox Code Playgroud)

您可以看到正在使用相同的逻辑线程,但性能在运行之间变化,但不在运行中(在运行中使用相同的对象)


我发现了这个问题.这是一个内存布局问题,但我可以看到一个简单的方法来解决它.ByteBuffer无法扩展,因此您无法添加填充,因此我创建了一个丢弃的对象.

    final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
    final ByteBuffer readBuffer = writeBuffer.slice();
    new PaddedAtomicInteger();
    final AtomicInteger readCount = new PaddedAtomicInteger();
    final AtomicInteger writeCount = new PaddedAtomicInteger();
Run Code Online (Sandbox Code Playgroud)

没有这个额外的填充(未使用的对象),结果在3.8 GHz i7上看起来像这样.

Message rate was 38.5 M/s offsets 3392 3904 4416
Message rate was 54.7 M/s offsets 3392 3904 4416
Message rate was 59.4 M/s offsets 3392 3904 4416

Message rate was 54.3 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112

Message rate was 28.0 M/s offsets 1088 1600 2112
Message rate was 28.1 M/s offsets 1088 1600 2112
Message rate was 28.0 M/s offsets 1088 1600 2112

Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112

Message rate was 54.5 M/s offsets 1088 1600 2112
Message rate was 54.2 M/s offsets 1088 1600 2112
Message rate was 55.1 M/s offsets 1088 1600 2112

Message rate was 25.5 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112

Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 54.7 M/s offsets 1088 1600 2112
Message rate was 54.4 M/s offsets 1088 1600 2112

Message rate was 57.0 M/s offsets 1088 1600 2112
Message rate was 55.9 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112

Message rate was 51.4 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 56.1 M/s offsets 1088 1600 2112

Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 47.4 M/s offsets 1088 1600 2112
Run Code Online (Sandbox Code Playgroud)

与丢弃的填充物体.

Message rate was 54.3 M/s offsets 3392 4416 4928
Message rate was 53.1 M/s offsets 3392 4416 4928
Message rate was 59.2 M/s offsets 3392 4416 4928

Message rate was 58.8 M/s offsets 1088 2112 2624
Message rate was 58.9 M/s offsets 1088 2112 2624
Message rate was 59.3 M/s offsets 1088 2112 2624

Message rate was 59.4 M/s offsets 1088 2112 2624
Message rate was 59.0 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624

Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.2 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.6 M/s offsets 1088 2112 2624

Message rate was 59.6 M/s offsets 1088 2112 2624
Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 61.6 M/s offsets 1088 2112 2624
Message rate was 60.8 M/s offsets 1088 2112 2624

Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 58.3 M/s offsets 1088 2112 2624
Run Code Online (Sandbox Code Playgroud)

不幸的是,在GC之后总是存在这样的风险,即物体将不会被最佳地布置.解决此问题的唯一方法可能是向原始类添加填充.:(

phi*_*lwb 24

我不是处理器缓存领域的专家,但我怀疑你的问题本质上是缓存问题或其他一些内存布局问题.重复分配缓冲区和计数器而不清除旧对象可能会导致您定期获得非常糟糕的缓存布局,这可能会导致性能不一致.

使用你的代码并制作一些mod我已经能够使性能保持一致(我的测试机器是英特尔酷睿2四核CPU Q6600 2.4GHz w/Win7x64 - 所以不太相同但希望足够接近以获得相关结果).我用两种不同的方式完成了这两种方式,两者的效果大致相同.

首先,在doTest方法之外移动缓冲区和计数器的创建,以便它们只创建一次,然后在测试的每次传递中重复使用.现在你得到了一个分配,它很好地放在缓存中,性能是一致的.

另一种获得相同重用但具有"不同"缓冲区/计数器的方法是在performTiming循环之后插入一个gc:

for ( int i = 0; i < 3; i++ )
    performTiming ( writeBuffer, readBuffer, readCount, writeCount );
System.out.println ();
System.gc ();
Run Code Online (Sandbox Code Playgroud)

这里的结果或多或少相同 - gc允许回收缓冲区/计数器,下一次分配最终重用相同的内存(至少在我的测试系统上),你最终在缓存中具有一致的性能(我还添加了)打印实际地址以验证相同位置的重用).我的猜测是,如果没有清理导致重用,你最终会得到一个不适合缓存的缓冲区,并且你的性能在交换时会受到影响.我怀疑你可以用分配顺序做一些奇怪的事情(就像你可以通过在缓冲区前移动计数器分配来使我的机器上的性能更差)或者在每次运行周围创建一些死空间以"清除"缓存,如果你不想从先前的循环中消除缓冲区.

最后,正如我所说,处理器缓存和内存布局的乐趣不是我的专业领域,所以如果解释是误导或错误 - 抱歉.


jta*_*orn 8

你在忙着等待.这在用户代码中总是一个坏主意.

读者:

while ((toRead = writeCount.get() - rc) <= 0) ;
Run Code Online (Sandbox Code Playgroud)

作家:

while (wc - readCount.get() > 0) ;
Run Code Online (Sandbox Code Playgroud)

  • 忙等待的原因是为了避免放弃核心并切换上下文.这可以显着增加延迟.使用wait/notify的速度略慢,但没有我想象的那么慢. (7认同)
  • +1.这就是`wait()`,`notify()`和`notifyAll()`的用法. (3认同)
  • 使用wait/notify可以使性能更加一致,但速度至少要慢4倍. (3认同)
  • @ z5h,等待/通知和锁定/条件都是这个代码的可怕想法(越可怕).Park/Unpark是经过一些忙碌旋转之后的方式,可能是Thread.yeild,然后退回. (2认同)

jef*_*unt 6

作为性能分析的一般方法:

  • 试试jconsole.启动您的应用程序,并在其运行时键入jconsole单独的终端窗口.这将打开Java Console GUI,它允许您连接到正在运行的JVM,并查看性能指标,内存使用情况,线程计数和状态等.
  • 基本上,您将不得不弄清楚速度流量与您看到的JVM正在做什么之间的相关性.启动您的任务管理器并查看您的系统是否实际上只是忙于执行其他操作(由于内存不足而占用磁盘,忙于执行繁重的后台任务等)也可能会有所帮助,并将其放在旁边.一边是jconsole窗户.
  • 另一种替代方法是使用-Xprof选项启动JVM,该选项在每个线程的基础上输出在各种方法中花费的相对时间.防爆.java -Xprof [your class file]
  • 最后,还有JProfiler,但它是一个商业工具,如果这对你很重要.

  • 我不会使用分析器进行这种测试,它会破坏性能.此外,几乎没有任何代码可供描述. (3认同)
  • 分析器有两种工作方式:采样和代码注入.采样很糟糕,因为它依赖于安全点来收集任何堆栈跟踪,即它依赖于JVM将放置安全点并且通常它将显示没有任何用处.代码注入更糟糕,它改变了JVM编译代码的方式并杀死了很多优化.简而言之,简单地分析低级别的东西是行不通的. (2认同)

Mat*_*att 6

编辑:似乎触发GC将改变行为.这些显示在相同的缓冲液+计数器上重复测试,手动触发GC中途.

GC意味着达到一个安全点,这意味着所有线程都已停止执行字节码并且GC线程有工作要做.这可能有各种副作用.例如,在没有任何显式cpu关联的情况下,您可以在不同的核心上重新启动执行,或者可能已刷新缓存行.你能跟踪线程运行的核心吗?

这些CPU是什么?你有没有做过关于电源管理的事情,以防止它们进入较低的p和/或c状态?也许1个线程被安排到处于不同p状态的核心上,因此显示出不同的性能配置文件.

编辑

我尝试在运行x64 linux的工作站上运行你的测试,其中包含2个稍微旧的四核xeon(E5504),它在运行中通常是一致的(~17-18M/s),运行速度慢得多,这似乎与线程迁移相对应.我没有严格地绘制这个.因此,您的问题似乎可能是CPU架构特定的.你提到你在4.6GHz运行i7,这是一个错字吗?我认为i7在3.5GHz时采用3.9Ghz turbo模式(早期版本为3.3GHz至3.6GHz turbo)达到顶峰.无论哪种方式,你确定你没有看到涡轮模式的诡计然后辍学吗?您可以尝试使用turbo禁用重复测试以确保.

其他几点

  • 填充值都是0,你确定没有对未初始化的值进行一些特殊处理吗?您可以考虑使用该LogCompilation选项来了解JIT如何处理该方法
  • 英特尔VTune可免费进行30天评估,如果这是一个缓存线问题,那么您可以使用它来确定主机上的问题


Beg*_*moT 6

你如何实际将线程固定到核心?taskset不是将线程固定到内核的最佳方法,因为它只是将进程固定到内核 - 而且它的所有线程都将共享这些内核.回想一下,java有许多内部线程可以满足它自己的需求,所以它们都会在你绑定它们的核心上竞争.

要获得更一致的结果,您可以使用JNA仅从您需要的线程调用sched_setaffinity().它将仅将您的基准测试线程固定到精确的核心,而其他Java线程将扩展到其他免费核心,对您的代码行为影响较小.

顺便说一下,在针对高度优化的并发代码进行基准测试时,我遇到了性能不稳定的类似问题.看来,就像在太接近硬件限制的情况下可以大大影响性能的东西一样.您应该以某种方式调整您的操作系统,为您的代码提供最佳的可能性,或者只是使用许多实验并使用数学来获得平均值和置信区间.

  • 另外,你可以看看Cliff Click的帖子http://www.azulsystems.com/blog/cliff/2011-09-23-a-pair-of-somebody-elses-concurrency-bugs(向下滚动直到他谈论Disruptor ).破坏程序环缓冲区与您的代码非常相似(易失性读/写上的共享缓冲区+ membars强制线程之间的数据传输 - 它们甚至使用lazySet进行易失性写入优化),Cliff观察到相同的3x性能不稳定性问题,因此他的描述可以帮助你理解问题.但他主要声称线程亲和力是原因. (3认同)
  • 好吧,如果GC是主要问题,那么似乎压缩的原因是什么 - 因为GC可能会进行内存碎片整理,在这里和那里移动对象,它可能是新的对象布局不合适的一个--CPU缓存并不那么简单就像"缓存行"和"虚假共享"一样 - 还有缓存关联性这样的东西.例如,readCount和writeCount虽然填充,但可以放置在这样的内存区域上,这些内存区域通过有限的关联性缓存映射到相同的缓存行.... (2认同)