最近我们刚刚注意到我们的许多服务器偶尔和突然(没有明显的逐渐降级)与以下堆栈锁定(所有其他theads都是BLOCKED,IN_NATIVE或IN_VM)(在我们的代码启动时被截断),使用jstack -F获得
Thread 18334: (state = IN_JAVA)
- java.util.Calendar.updateTime() @bci=1, line=2469 (Compiled frame; information may be imprecise)
- java.util.Calendar.getTimeInMillis() @bci=8, line=1088 (Compiled frame)
(truncated)
Run Code Online (Sandbox Code Playgroud)
失败似乎是在一个完整的gc之后不久发生的,并且top -H -p显示有两个线程,一个似乎是上面的线程,另一个是gc线程或jitc,每个pstack的输出(不是VMThread :: run()):
Thread 331 (Thread 0x7f59641bc700 (LWP 16461)):
#0 0x00007f63f9ed0ef8 in SafepointSynchronize::begin() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#1 0x00007f63f9fbab7c in VMThread::loop() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#2 0x00007f63f9fba68e in VMThread::run() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#3 0x00007f63f9e5e7af in java_start(Thread*) () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#4 0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0
#5 0x00000035bb4e811d in clone () from /lib64/libc.so.6
Run Code Online (Sandbox Code Playgroud)
有没有人有任何想法为什么这可能已经开始发生?
我们在CentOS版本5.7和6.3上使用jdk1.6.0_33在具有24个核心(12个物理)的服务器上.
这里有一些堆栈,我们的代码被截断了:
Thread 22561: (state = IN_VM)
- java.lang.String.toLowerCase(java.util.Locale) @bci=428, line=2782 (Compiled frame; information may be imprecise)
- java.lang.String.toLowerCase() @bci=4, line=2847 (Compiled frame)
(truncated)
Thread 22562: (state = IN_VM)
- java.util.HashMap.put(java.lang.Object, java.lang.Object) @bci=20, line=403 (Compiled frame; information may be imprecise)
- java.util.HashSet.add(java.lang.Object) @bci=8, line=200 (Compiled frame)
(truncated)
Thread 22558: (state = BLOCKED)
- sun.nio.ch.EPollSelectorImpl.wakeup() @bci=6, line=173 (Compiled frame)
- org.mortbay.io.nio.SelectorManager$SelectSet.wakeup() @bci=10, line=706 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.updateKey() @bci=135, line=344 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.undispatch() @bci=10, line=204 (Compiled frame)
- org.mortbay.jetty.nio.SelectChannelConnector$ConnectorEndPoint.undispatch() @bci=54, line=382 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.run() @bci=44, line=449 (Compiled frame)
- org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=534 (Compiled frame)
Thread 22557: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Object.wait(long, int) @bci=58, line=443 (Compiled frame)
- com.stumbleupon.async.Deferred.doJoin(boolean, long) @bci=244, line=1148 (Compiled frame)
- com.stumbleupon.async.Deferred.join(long) @bci=3, line=1028 (Compiled frame)
(truncated)
Thread 20907: (state = IN_NATIVE)
- java.net.PlainSocketImpl.socketAccept(java.net.SocketImpl) @bci=0 (Interpreted frame)
- java.net.PlainSocketImpl.accept(java.net.SocketImpl) @bci=7, line=408 (Interpreted frame)
- java.net.ServerSocket.implAccept(java.net.Socket) @bci=60, line=462 (Interpreted frame)
- java.net.ServerSocket.accept() @bci=48, line=430 (Interpreted frame)
- sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop() @bci=55, line=369 (Interpreted frame)
- sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run() @bci=1, line=341 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)
Thread 22901: (state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
- net.spy.memcached.MemcachedConnection.handleIO() @bci=126, line=188 (Compiled frame)
- net.spy.memcached.MemcachedClient.run() @bci=11, line=1591 (Compiled frame)
Run Code Online (Sandbox Code Playgroud)
如果您使用 jmxtrans 之类的工具每 5 分钟收集一堆虚拟机信息,并使用 Graphite 之类的工具绘制数据图表,那么它可以帮助调试此类事情。
您可能认为没有什么可辨别的,但这可能是因为您只查看一个数据点,即响应时间。收集 JVM 通过 JMX 公开的所有不同 GC 相关数据点,并查看其中之一是否确实发出警告。如果您的应用程序定期分配和释放相同数量 (x%) 的堆,则这可能与获得 x% 的可用堆空间有关。您需要研究各种比例(放大和缩小)的图表,以了解应用程序的正常行为。
| 归档时间: |
|
| 查看次数: |
1784 次 |
| 最近记录: |