Cassandra错误消息:由于本地暂停而未将节点标记为关闭.为什么?

pet*_*ter 8 amazon-ec2 cassandra datastax apache-spark datastax-startup

我有6个节点,1个Solr,5个Spark节点,使用数据传输.我的群集位于与Amazon EC2类似的服务器上,具有EBS卷.每个节点有3个EBS卷,使用LVM组成逻辑数据磁盘.在我的OPS中心,同一节点经常无响应,这导致我的数据系统连接超时.我的数据量约为400GB,包含3个副本.我有20个流媒体作业,每分钟有一个批处理间隔.这是我的错误消息:

/var/log/cassandra/output.log:WARN 13:44:31,868 Not marking nodes down due to local pause of 53690474502 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:40:34,944 FailureDetector.java:258 - Not marking nodes down due to local pause of 64532052919 > 5000000000 
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:59:12,023 FailureDetector.java:258 - Not marking nodes down due to local pause of 66027485893 > 5000000000 
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-26 13:44:31,868 FailureDetector.java:258 - Not marking nodes down due to local pause of 53690474502 > 5000000000
Run Code Online (Sandbox Code Playgroud)

编辑:

这些是我更具体的配置.我想知道我做错了什么,如果是的话,我怎么能详细了解它是什么以及如何解决它?

out heap设置为

MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"
Run Code Online (Sandbox Code Playgroud)

当前堆:

[root@iZ11xsiompxZ ~]# jstat -gc 11399
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
 0.0   196608.0  0.0   196608.0 6717440.0 2015232.0 43417600.0 23029174.0 69604.0 68678.2  0.0    0.0     1041  131.437   0      0.000  131.437
[root@iZ11xsiompxZ ~]# jmap -heap 11399
Attaching to process ID 11399, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14

using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)
Run Code Online (Sandbox Code Playgroud)

堆配置:

MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 51539607552 (49152.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 30920409088 (29488.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 16777216 (16.0MB)
Run Code Online (Sandbox Code Playgroud)

堆用法:

G1 Heap:
   regions  = 3072
   capacity = 51539607552 (49152.0MB)
   used     = 29923661848 (28537.427757263184MB)
   free     = 21615945704 (20614.572242736816MB)
   58.059545404588185% used
G1 Young Generation:
Eden Space:
   regions  = 366
   capacity = 6878658560 (6560.0MB)
   used     = 6140461056 (5856.0MB)
   free     = 738197504 (704.0MB)
   89.26829268292683% used
Survivor Space:
   regions  = 12
   capacity = 201326592 (192.0MB)
   used     = 201326592 (192.0MB)
   free     = 0 (0.0MB)
   100.0% used
G1 Old Generation:
   regions  = 1443
   capacity = 44459622400 (42400.0MB)
   used     = 23581874200 (22489.427757263184MB)
   free     = 20877748200 (19910.572242736816MB)
   53.04110320109241% used

40076 interned Strings occupying 7467880 bytes.
Run Code Online (Sandbox Code Playgroud)

我不知道为什么会这样.非常感谢.

mar*_*rkc 4

您看到的消息Not marking nodes down due to local pause是由于 JVM 暂停造成的。尽管您通过发布 JVM 信息在这里做了一些好事,但通常一个好的起点是查看/var/log/cassandra/system.log诸如ERROR,之类的检查WARN。还可以通过 grep for 来检查 GC 事件的长度和频率GCInspector

像你的朋友这样的工具nodetool tpstats,看看你是否备份或删除了突变,阻止了同花写入器等等。

这里的文档有一些需要检查的好东西:https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html

还要检查您的节点是否具有推荐的生产设置,这是经常被忽视的:

http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html

另外需要注意的一件事是 Cassandra 对 I/O 相当敏感,“正常”EBS 可能不够快以满足您的需要。将 Solr 也加入其中,当您同时进行 Cassandra 压缩和 Lucene Merge 处理磁盘时,您会看到大量 I/O 争用。