使用大堆的Java速度较慢,没有过多的gc-ing

Pen*_*eng 19 java performance jvm heap-memory

原始问题

我正在运行一个Java程序,比如program.jar,它带有一个"小"初始堆(1gb)和一个"大"初始堆(16gb).当我用小的初始堆运行它,即

java -jar -Xms1g -Xmx16g program.jar
Run Code Online (Sandbox Code Playgroud)

程序在12秒内终止(多次运行平均).相反,当我用大的初始堆运行它,即

java -jar -Xms16g -Xmx16g program.jar
Run Code Online (Sandbox Code Playgroud)

程序在30秒内终止(多次运行平均).

我从SO的其他问题中了解到,一般来说,大堆可能会导致过多的垃圾收集,从而减慢了程序的速度:

但是,当我使用flag运行program.jar时-verbose:gc,不会报告任何具有大型初始堆的gc-activity.初始堆很小,在开始测量时间之前,在程序的初始化阶段只有一些gc活动.因此,过多的垃圾收集似乎并未澄清我的观察结果.

为了让它更混乱(至少对我而言),我有一个功能相同的程序,比如program2.jar,它具有与program.jar相同的输入输出行为.主要区别在于program2.jar使用的效率低于program.jar,至少就内存而言(无论program2.jar在时间上效率是否也低,实际上我正在尝试确定).但无论我是使用小的初始堆还是大的初始堆运行program2.jar,它总是在大约22秒内终止(包括大约2-3秒的gc-ing).

所以,这是我的问题:(怎么样)大堆可以减慢程序,放弃过多的gc-ing?

(这个问题可能看起来类似于Georg在" Java慢大堆 "中的问题,但他的问题与堆无关.在我的情况下,我觉得它必须与堆有关,因为它是唯一的program.jar的两次运行之间的差异.)

以下是一些可能相关的细节.我正在使用Java 7,OpenJDK:

> java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
Run Code Online (Sandbox Code Playgroud)

我的机器有两个E5-2690V3处理器(http://ark.intel.com/products/81713)在两个插槽中(超线程和Turbo Boost禁用)并且有足够的内存(64gb),其中大约一半是免费的在运行程序之前:

> free
             total       used       free     shared    buffers     cached
Mem:      65588960   31751316   33837644         20     154616   23995164
-/+ buffers/cache:    7601536   57987424
Swap:      1023996      11484    1012512
Run Code Online (Sandbox Code Playgroud)

最后,程序有多个线程(大约70个).

编辑1

回应布鲁诺雷斯kdgregory

我收集了一些额外的统计数据.这是针对初始堆小的program.jar:

Command being timed: "java -Xms1g -Xmx16g -verbose:gc -jar program.jar"
User time (seconds): 339.11
System time (seconds): 29.86
Percent of CPU this job got: 701%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 12192224
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 771372
Voluntary context switches: 7446
Involuntary context switches: 27788
Swaps: 0
File system inputs: 10216
File system outputs: 120
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run Code Online (Sandbox Code Playgroud)

这适用于具有大型初始堆的program.jar:

Command being timed: "java -Xms16g -Xmx16g -verbose:gc -jar program.jar"
User time (seconds): 769.13
System time (seconds): 28.04
Percent of CPU this job got: 1101%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.34
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 10974528
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 16
Minor (reclaiming a frame) page faults: 687727
Voluntary context switches: 6769
Involuntary context switches: 68465
Swaps: 0
File system inputs: 2032
File system outputs: 160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run Code Online (Sandbox Code Playgroud)

(此处报告的挂钟时间与我原始帖子中报告的挂钟时间不同,因为之前未定时的初始化阶段.)

编辑2

回应8472的初步答案和后来的评论

我收集了一些有关缓存的统计信息.这适用于具有较小初始堆(平均超过30次运行)的program.jar:

  2719852136 cache-references                                              ( +-  1.56% ) [42.11%]
  1931377514 cache-misses              #   71.010 % of all cache refs      ( +-  0.07% ) [42.11%]
 56748034419 L1-dcache-loads                                               ( +-  1.34% ) [42.12%]
 16334611643 L1-dcache-load-misses     #   28.78% of all L1-dcache hits    ( +-  1.70% ) [42.12%]
 24886806040 L1-dcache-stores                                              ( +-  1.47% ) [42.12%]
  2438414068 L1-dcache-store-misses                                        ( +-  0.19% ) [42.13%]
           0 L1-dcache-prefetch-misses                                    [42.13%]
    23243029 L1-icache-load-misses                                         ( +-  0.66% ) [42.14%]
  2424355365 LLC-loads                                                     ( +-  1.73% ) [42.15%]
   278916135 LLC-stores                                                    ( +-  0.30% ) [42.16%]
   515064030 LLC-prefetches                                                ( +-  0.33% ) [10.54%]
 63395541507 dTLB-loads                                                    ( +-  0.17% ) [15.82%]
  7402222750 dTLB-load-misses          #   11.68% of all dTLB cache hits   ( +-  1.87% ) [21.08%]
 20945323550 dTLB-stores                                                   ( +-  0.69% ) [26.34%]
   294311496 dTLB-store-misses                                             ( +-  0.16% ) [31.60%]
    17012236 iTLB-loads                                                    ( +-  2.10% ) [36.86%]
      473238 iTLB-load-misses          #    2.78% of all iTLB cache hits   ( +-  2.88% ) [42.12%]
 29390940710 branch-loads                                                  ( +-  0.18% ) [42.11%]
    19502228 branch-load-misses                                            ( +-  0.57% ) [42.11%]

53.771209341 seconds time elapsed                                          ( +-  0.42% )
Run Code Online (Sandbox Code Playgroud)

这适用于具有大型初始堆(平均超过30次运行)的program.jar:

 10465831994 cache-references                                              ( +-  3.00% ) [42.10%]
  1921281060 cache-misses              #   18.358 % of all cache refs      ( +-  0.03% ) [42.10%]
 51072650956 L1-dcache-loads                                               ( +-  2.14% ) [42.10%]
 24282459597 L1-dcache-load-misses     #   47.54% of all L1-dcache hits    ( +-  0.16% ) [42.10%]
 21447495598 L1-dcache-stores                                              ( +-  2.46% ) [42.10%]
  2441970496 L1-dcache-store-misses                                        ( +-  0.13% ) [42.10%]
           0 L1-dcache-prefetch-misses                                    [42.11%]
    24953833 L1-icache-load-misses                                         ( +-  0.78% ) [42.12%]
 10234572163 LLC-loads                                                     ( +-  3.09% ) [42.13%]
   240843257 LLC-stores                                                    ( +-  0.17% ) [42.14%]
   462484975 LLC-prefetches                                                ( +-  0.22% ) [10.53%]
 62564723493 dTLB-loads                                                    ( +-  0.10% ) [15.80%]
 12686305321 dTLB-load-misses          #   20.28% of all dTLB cache hits   ( +-  0.01% ) [21.06%]
 19201170089 dTLB-stores                                                   ( +-  1.11% ) [26.33%]
   279236455 dTLB-store-misses                                             ( +-  0.10% ) [31.59%]
    16259758 iTLB-loads                                                    ( +-  4.65% ) [36.85%]
      466127 iTLB-load-misses          #    2.87% of all iTLB cache hits   ( +-  6.68% ) [42.11%]
 28098428012 branch-loads                                                  ( +-  0.13% ) [42.10%]
    18707911 branch-load-misses                                            ( +-  0.82% ) [42.10%]

73.576058909 seconds time elapsed                                          ( +-  0.54% )
Run Code Online (Sandbox Code Playgroud)

比较绝对数字,较大的初始堆导致L1-dcache-load-miss大约多50%,dTLB-load-miss大约多70%.在我的2.6 ghz机器上假设100次循环/未命中(源:维基百科),我对dTLB负载未命中进行了包络计算,这给出了大的初始堆延迟484秒而不是284秒小的延迟.我不知道如何将这个数字转换回每个核心的延迟(可能不仅仅是除以核心数量?),但数量级似乎是合理的.

在收集这些统计信息之后,我还-XX:+PrintFlagsFinal对大小初始堆的输出进行了区分(基于这两种情况中的每一种的一次运行):

<     uintx InitialHeapSize                          := 17179869184     {product}           
---
>     uintx InitialHeapSize                          := 1073741824      {product}  
Run Code Online (Sandbox Code Playgroud)

因此,没有其他标志似乎受到影响-Xms.这里也是-XX:+PrintGCDetailsprogram.jar 的输出,初始堆很小:

[GC [PSYoungGen: 239882K->33488K(306176K)] 764170K->983760K(1271808K), 0.0840630 secs] [Times: user=0.70 sys=0.66, real=0.09 secs]
[Full GC [PSYoungGen: 33488K->0K(306176K)] [ParOldGen: 950272K->753959K(1508352K)] 983760K->753959K(1814528K) [PSPermGen: 2994K->2993K(21504K)], 0.0560900 secs] [Times: user=0.20 sys=0.03, real=0.05 secs]
[GC [PSYoungGen: 234744K->33056K(306176K)] 988704K->983623K(1814528K), 0.0416120 secs] [Times: user=0.69 sys=0.03, real=0.04 secs]
[GC [PSYoungGen: 264198K->33056K(306176K)] 1214765K->1212999K(1814528K), 0.0489600 secs] [Times: user=0.61 sys=0.23, real=0.05 secs]
[Full GC [PSYoungGen: 33056K->0K(306176K)] [ParOldGen: 1179943K->1212700K(2118656K)] 1212999K->1212700K(2424832K) [PSPermGen: 2993K->2993K(21504K)], 0.1589640 secs] [Times: user=2.27 sys=0.10, real=0.16 secs]
[GC [PSYoungGen: 230538K->33056K(431616K)] 1443238K->1442364K(2550272K), 0.0523620 secs] [Times: user=0.69 sys=0.23, real=0.05 secs]
[GC [PSYoungGen: 427431K->33152K(557568K)] 1836740K->1835676K(2676224K), 0.0774750 secs] [Times: user=0.64 sys=0.72, real=0.08 secs]
[Full GC [PSYoungGen: 33152K->0K(557568K)] [ParOldGen: 1802524K->1835328K(2897920K)] 1835676K->1835328K(3455488K) [PSPermGen: 2993K->2993K(21504K)], 0.2019870 secs] [Times: user=2.74 sys=0.13, real=0.20 secs]
[GC [PSYoungGen: 492503K->33280K(647168K)] 2327831K->2327360K(3545088K), 0.0870810 secs] [Times: user=0.61 sys=0.92, real=0.09 secs]
[Full GC [PSYoungGen: 33280K->0K(647168K)] [ParOldGen: 2294080K->2326876K(3603968K)] 2327360K->2326876K(4251136K) [PSPermGen: 2993K->2993K(21504K)], 0.0512730 secs] [Times: user=0.09 sys=0.12, real=0.05 secs]

Heap
 PSYoungGen  total 647168K, used 340719K [0x00000006aaa80000, 0x00000006dd000000, 0x0000000800000000)
  eden space 613376K, 55% used [0x00000006aaa80000,0x00000006bf73bc10,0x00000006d0180000)
  from space 33792K, 0% used [0x00000006d2280000,0x00000006d2280000,0x00000006d4380000)
  to   space 33792K, 0% used [0x00000006d0180000,0x00000006d0180000,0x00000006d2280000)
 ParOldGen   total 3603968K, used 2326876K [0x0000000400000000, 0x00000004dbf80000, 0x00000006aaa80000)
  object space 3603968K, 64% used [0x0000000400000000,0x000000048e0572d8,0x00000004dbf80000)
 PSPermGen   total 21504K, used 3488K [0x00000003f5a00000, 0x00000003f6f00000, 0x0000000400000000)
  object space 21504K, 16% used [0x00000003f5a00000,0x00000003f5d68070,0x00000003f6f00000)
Run Code Online (Sandbox Code Playgroud)

对于具有大型初始堆的program.jar:

Heap
 PSYoungGen  total 4893696K, used 2840362K [0x00000006aaa80000, 0x0000000800000000, 0x0000000800000000)
  eden space 4194816K, 67% used [0x00000006aaa80000,0x000000075804a920,0x00000007aab00000)
  from space 698880K, 0% used [0x00000007d5580000,0x00000007d5580000,0x0000000800000000)
  to   space 698880K, 0% used [0x00000007aab00000,0x00000007aab00000,0x00000007d5580000)
 ParOldGen   total 11185152K, used 0K [0x00000003fff80000, 0x00000006aaa80000, 0x00000006aaa80000)
  object space 11185152K, 0% used [0x00000003fff80000,0x00000003fff80000,0x00000006aaa80000)
 PSPermGen   total 21504K, used 3489K [0x00000003f5980000, 0x00000003f6e80000, 0x00000003fff80000)
  object space 21504K, 16% used [0x00000003f5980000,0x00000003f5ce8400,0x00000003f6e80000)
Run Code Online (Sandbox Code Playgroud)

the*_*472 8

访问内存会占用CPU时间.访问更多内存不仅会花费更多的CPU时间,而且可能会增加缓存压力,从而导致错过率,从而使您的CPU时间超线性化.

运行程序perf stat java -jar ...以查看缓存未命中数.请参阅Perf教程


系统延迟的时间范围示例
图像来源:"系统性能:企业和云平装",Brendan Gregg,ISBN:978-0133390094


由于初始堆大小也会影响eden空间大小,并且较小的eden空间会触发GC,这会导致更紧凑的堆,这可以更加缓存(没有临时启动对象乱丢垃圾堆).

要减少两次运行之间的差异数量,请尝试将初始和最大年轻生成大小设置为两次运行的相同值,以便只有旧生成大小不同.这应该 - 可能 - 导致相同的表现.

暂且不说:您也可以尝试使用大页面启动JVM,它可能(您需要测量!)通过进一步减少TLB未命中率来获得一些额外的性能.

请注意未来的读者:限制新的gen大小并不一定能让你的JVM更快,它只是触发了一个GC,它恰好使@Peng的特定工作负载更快.
启动后手动触发GC会产生相同的效果.

  • 而已!如果我将年轻代的大小设置为相同的值,则小/大初始堆的性能变得可比.因此,总结一下:性能差异可以通过缓存行为的差异来解释,这可以通过gc行为来解释,gc行为可以通过年轻代大小来解释,这可以通过初始堆大小来解释.非常感谢你的帮助 :). (3认同)