为什么我会遇到长时间的非 GC 相关应用程序暂停?

jgo*_*etz 5 java memory performance garbage-collection

我一直在 GC 日志中看到非常长的应用程序暂停,这些暂停似乎与垃圾收集无关。该应用程序位于 16GB 堆上,分配的内存吞吐量非常大。我知道 GC 日志会显示与 GC 不一定相关的暂停,这里似乎就是这种情况。

此外,在这些暂停期间,CPU 达到 100%,没有磁盘或内存 I/O,也没有日志记录活动(由于所有应用程序线程都被停止)。

关于什么可能导致这些长时间停顿的任何想法,或者关于如何追踪根本原因的任何建议?

这是日志中的相关片段(注意 151 秒的暂停):

2015-04-21T06:30:04.834-0500: 217179.483: [GC (Allocation Failure)2015-04-21T06:30:04.849-0500: 217179.483: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   19876896 bytes,   19876896 total
- age   2:   15903904 bytes,   35780800 total
- age   3:     617120 bytes,   36397920 total
- age   4:      34096 bytes,   36432016 total
- age   5:     399896 bytes,   36831912 total
- age   6:      71112 bytes,   36903024 total
: 706301K->45145K(766784K), 0.0681632 secs] 19029029K->18368371K(25080640K), 0.0684183 secs] [Times: user=0.44 sys=0.00, real=0.08 secs] 
2015-04-21T06:30:05.816-0500: 217180.453: Total time for which application threads were stopped: 1.8507910 seconds
2015-04-21T06:30:09.171-0500: 217183.812: [GC (Allocation Failure)2015-04-21T06:30:09.171-0500: 217183.812: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   14812976 bytes,   14812976 total
- age   2:    3411968 bytes,   18224944 total
- age   3:   14952032 bytes,   33176976 total
- age   4:     430048 bytes,   33607024 total
- age   5:      32864 bytes,   33639888 total
- age   6:     374880 bytes,   34014768 total
: 726745K->43352K(766784K), 0.0673428 secs] 19049971K->18366645K(25080640K), 0.0675792 secs] [Times: user=0.48 sys=0.00, real=0.06 secs] 
2015-04-21T06:30:09.249-0500: 217183.886: Total time for which application threads were stopped: 0.0763406 seconds
2015-04-21T06:32:43.960-0500: 217338.603: Total time for which application threads were stopped: 151.1866188 seconds
2015-04-21T06:32:43.975-0500: 217338.607: [GC (GCLocker Initiated GC)2015-04-21T06:32:43.975-0500: 217338.607: [ParNew
Desired survivor size 43614208 bytes, new threshold 4 (max 6)
- age   1:   18173880 bytes,   18173880 total
- age   2:    9659800 bytes,   27833680 total
- age   3:    3400176 bytes,   31233856 total
- age   4:   14928944 bytes,   46162800 total
- age   5:     375056 bytes,   46537856 total
- age   6:      31880 bytes,   46569736 total
: 727682K->56678K(766784K), 0.0759093 secs] 19050975K->18380341K(25080640K), 0.0760795 secs] [Times: user=0.58 sys=0.00, real=0.08 secs] 
2015-04-21T06:32:44.053-0500: 217338.684: Total time for which application threads were stopped: 0.0769955 seconds
Run Code Online (Sandbox Code Playgroud)

这是我目前的 JVM 设置:

-Xmx16g
-XX:MaxPermSize=256m
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.login.config=JmxFileLoginModule
-Dcom.sun.management.jmxremote.port=8082
-Dcom.sun.management.jmxremote.ssl=false
-Dfile.encoding=UTF8
-Djava.awt.headless=true
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xloggc:{fileLoc}
-XX:+DisableExplicitGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCCause
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:GCLogFileSize=50M
-XX:NumberOfGCLogFiles=10
-XX:ReservedCodeCacheSize=256M
-XX:-UseGCLogFileRotation
-XX:+PrintSafepointStatistics
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=4000
-XX:+PrintSafepointStatistics
Run Code Online (Sandbox Code Playgroud)

更新:

我忘了在此处包含我的安全点日志记录。这是我的安全点日志在上面经历的 151 秒暂停时的样子:

217338.603: [sweeping nmethods, 0.0000036 secs]
             vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
    217187.422: GenCollectForAllocation          [     140         46              1    ]      [     1     0151184     2     0    ]  0
Run Code Online (Sandbox Code Playgroud)

区块时间为 0,但它出现在同步时间旁边,所以同步时间实际上是 151184。

对于其他与 GC 相关的暂停,安全点日志如下所示:

107100.562: [sweeping nmethods, 0.0000015 secs]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
107100.563: GenCollectForAllocation          [     117          0              0    ]      [     0     0     0     0 38238    ]  0
Run Code Online (Sandbox Code Playgroud)

因此,GC 暂停显示为 vmop 时间,而非 GC 暂停显示为同步时间。我将不得不阅读更多关于这意味着什么。

apa*_*gin 2

我们遇到过类似的问题,似乎是 JVM bug JDK-8041984
我不确定这是否也是您的情况,但请考虑更新到 JDK 8u45 或 7u80。

每当您再次看到此问题时,请尝试在长时间暂停期间使用本机帧进行强制堆栈转储:

jstack -F -m <PID>
Run Code Online (Sandbox Code Playgroud)

即使进程没有响应,这也有助于识别虚拟机正在做什么。