突然 Mongodb 高连接/队列,db 停止响应

clo*_*pez 13 mongodb monitoring

问题

我们的 mongodb 设置有一个奇怪的问题。有时我们会遇到高连接和高队列的峰值,如果我们让队列和连接增加,mongodb 进程就会停止响应。我们需要使用带有htop 的sigkill重新启动实例。

好像是系统限制/mongodb配置阻止mongodb运行,​​因为硬件资源没问题。此问题的版本发生在单机上,然后在生产服务器上设置副本。详情在前面。

关于软件环境

这是一个独立的 mongodb 实例(非分片或副本集),它在专用机器上运行,并由其他机器查询。我在 Debian 7.7 下使用 mongodb-linux-x86_64-2.6.12。

查询 mongo 的机器使用 Django==1.7.4、Mongoengine=0.10.1 和 pymongo==2.8、nginx 1.6.2 和 gunicorn 19.1.1。

在 Django settings.py 文件中,我使用以下几行连接到数据库:

from mongoengine import connect

connect(
    MONGO_DB,
    username = MONGO_USER,
    password = MONGO_PWD,
    host = MONGO_HOST,
    port = MONGO_PORT
)
Run Code Online (Sandbox Code Playgroud)

彩信统计

正如您在 MMS 服务的以下 img 中看到的,我们在连接和队列上有峰值:

毫米

发生这种情况时,我们的 mongodb 进程完全冻结。我们必须使用SIGKILL来重启mongodb,这真的很糟糕。

图中有 3 个冻结事件

队列

正如 img 所示,当这种情况发生时,我们在非映射虚拟内存上也有一个峰值。

虚拟内存

我们还发现 Btree 图表在第 2 次和第 3 次冻结前后有所增加。

树

我们检查了日志,但没有可疑的查询,而且 Opcounters 也没有暴涨,似乎没有比平时更多的查询。

这是同一错误的另一个屏幕截图,但在另一天/时间: 更多关于错误

在所有情况下,DB 上的锁都没有显着增加,它有一个峰值,但甚至没有达到 4%:

在此处输入图片说明

OpCounter 降为零,似乎每个 op 都进入 mongodb 队列,因此数据库创建新连接以尝试执行新请求,所有这些连接也进入队列。

机器资源

在硬件方面,该机器是一个谷歌云计算实例,具有 4 个 Intel Xeon Cores、16 Gb ram、100 GB SSD 磁盘

没有检测到明显的高网络/io/CPU/ram 问题,没有资源峰值,即使 mongod 进程被冻结。

在此处输入图片说明

另一台机器上的 MySQL 也受到影响

我们还检测到,在队列和连接上的这个 mongod 峰值的同时,我们还在另一台机器上运行的 mysql 连接上出现了峰值。当我杀死 mongodb 进程时,所有 mysql 连接也被释放(不重新启动 mysql)。

在此处输入图片说明

超限

我们按照这篇 MongoDB 文章中的建议设置了系统限制,看看这是否是问题的原因,但似乎这并没有解决问题。

连接高峰仍在继续。当这个问题开始时,应用程序的每个请求似乎都进入了队列。

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 60240
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 409600
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 60240
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Run Code Online (Sandbox Code Playgroud)

db.currentOp

我刚刚添加了一个每 1 秒运行一次的 shell 脚本,内容如下:

var ops = db.currentOp().inprog
if (ops !== undefined && ops.length > 0){
    ops.forEach(function(op){
      if(op.secs_running > 0) printjson(op);
    })
}
Run Code Online (Sandbox Code Playgroud)

日志不会报告执行时间超过 1 秒的任何操作。我正在考虑在某件事上花费很长时间的过程,但似乎并非如此。

MongoDB 线程

与连接类似,我正在监视 mongod -f 进程的线程,这是发生了什么,类似于连接:

[Wed May 18 19:02:01 UTC 2016] MONGOD PROCESSES 1 THREADS 94
[Wed May 18 19:03:01 UTC 2016] MONGOD PROCESSES 1 THREADS 94
# starts
[Wed May 18 19:04:01 UTC 2016] MONGOD PROCESSES 1 THREADS 96
[Wed May 18 19:05:01 UTC 2016] MONGOD PROCESSES 1 THREADS 118
[Wed May 18 19:09:01 UTC 2016] MONGOD PROCESSES 1 THREADS 196
[Wed May 18 19:10:01 UTC 2016] MONGOD PROCESSES 1 THREADS 211
# sigkill to mongodb
[Wed May 18 19:11:01 UTC 2016] MONGOD PROCESSES 3 THREADS 6
[Wed May 18 19:12:01 UTC 2016] MONGOD PROCESSES 1 THREADS 43
[Wed May 18 19:13:01 UTC 2016] MONGOD PROCESSES 1 THREADS 48
Run Code Online (Sandbox Code Playgroud)

MongoDB 日志

关于 mongodb.log,这里是解决问题的完整 mongodb 日志。它只发生在日志行 361 上。连接开始上升,不再执行查询。我也不能调用 mongo shell,它说:

[Wed Feb 10 15:46:01 UTC 2016] 2016-02-10T15:48:31.940+0000 DBClientCursor::init call() failed
2016-02-10T15:48:31.941+0000 Error: DBClientBase::findN: transport error: 127.0.0.1:27000 ns: admin.$cmd query: { whatsmyuri: 1 } at src/mongo/shell/mongo.js:148
Run Code Online (Sandbox Code Playgroud)

日志提取

2016-02-10T15:41:39.930+0000 [initandlisten] connection accepted from 10.240.0.3:56611 #3665 (79 connections now open)
2016-02-10T15:41:39.930+0000 [conn3665] command admin.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:39.930+0000 [conn3665] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 numYields:0  reslen:37 0ms
2016-02-10T15:41:39.992+0000 [conn3529] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 310 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:215 reslen:48 0ms
2016-02-10T15:41:40.038+0000 [conn2303] query db.column query: { _id: ObjectId('56b395dfbe66324cbee550b8'), client_id: 20 } planSummary: IXSCAN { _id: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:116 nreturned:1 reslen:470 0ms
2016-02-10T15:41:40.044+0000 [conn1871] update db.column query: { _id: ObjectId('56b395dfbe66324cbee550b8') } update: { $set: { last_request: new Date(1455118900040) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 fastmod:1 keyUpdates:0 numYields:0 locks(micros) w:126 0ms
2016-02-10T15:41:40.044+0000 [conn1871] command db.$cmd command: update { update: "column", writeConcern: { w: 1 }, updates: [ { q: { _id: ObjectId('56b395dfbe66324cbee550b8') }, u: { $set: { last_request: new Date(1455118900040) } }, multi: false, upsert: true } ] } keyUpdates:0 numYields:0  reslen:55 0ms
2016-02-10T15:41:40.048+0000 [conn1875] query db.user query: { sn: "mobile", client_id: 20, uid: "56990023700" } planSummary: IXSCAN { client_id: 1, uid: 1, sn: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:197 nreturned:1 reslen:303 0ms
2016-02-10T15:41:40.056+0000 [conn2303] Winning plan had zero results. Not caching. ns: db.case query: { sn: "mobile", client_id: 20, created: { $gt: new Date(1454295600000), $lt: new Date(1456800900000) }, deleted: false, establishment_users: { $all: [ ObjectId('5637640afefa2654b5d863e3') ] }, is_closed: true, updated_time: { $gt: new Date(1455045840000) } } sort: { updated_time: 1 } projection: {} skip: 0 limit: 15 winner score: 1.0003 winner summary: IXSCAN { client_id: 1, is_closed: 1, deleted: 1, updated_time: 1 }
2016-02-10T15:41:40.057+0000 [conn2303] query db.case query: { $query: { sn: "mobile", client_id: 20, created: { $gt: new Date(1454295600000), $lt: new Date(1456800900000) }, deleted: false, establishment_users: { $all: [ ObjectId('5637640afefa2654b5d863e3') ] }, is_closed: true, updated_time: { $gt: new Date(1455045840000) } }, $orderby: { updated_time: 1 } } planSummary: IXSCAN { client_id: 1, is_closed: 1, deleted: 1, updated_time: 1 } ntoreturn:15 ntoskip:0 nscanned:26 nscannedObjects:26 keyUpdates:0 numYields:0 locks(micros) r:5092 nreturned:0 reslen:20 5ms
2016-02-10T15:41:40.060+0000 [conn300] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 309 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:63 reslen:48 0ms
2016-02-10T15:41:40.547+0000 [conn3529] query db.case query: { $query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } }, $orderby: { updated_time: -1 } } planSummary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 } ntoskip:0 nscanned:103 nscannedObjects:103 keyUpdates:0 numYields:0 locks(micros) r:9410 nreturned:0 reslen:20 9ms
2016-02-10T15:41:40.557+0000 [conn3529] Winning plan had zero results. Not caching. ns: db.case query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } } sort: { updated_time: -1 } projection: {} skip: 0 limit: 15 winner score: 1.0003 winner summary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 }
2016-02-10T15:41:40.558+0000 [conn3529] query db.case query: { $query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } }, $orderby: { updated_time: -1 } } planSummary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 } ntoreturn:15 ntoskip:0 nscanned:103 nscannedObjects:103 keyUpdates:0 numYields:0 locks(micros) r:7572 nreturned:0 reslen:20 7ms
2016-02-10T15:41:40.569+0000 [conn3028] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 145 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:237 reslen:48 0ms
2016-02-10T15:41:40.774+0000 [conn3053] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 143 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:372 reslen:48 0ms
2016-02-10T15:41:41.056+0000 [conn22] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 numYields:0  reslen:37 0ms

#########################
HERE THE PROBLEM STARTS
#########################

2016-02-10T15:41:41.175+0000 [initandlisten] connection accepted from 127.0.0.1:43268 #3667 (80 connections now open)
2016-02-10T15:41:41.212+0000 [initandlisten] connection accepted from 10.240.0.6:46021 #3668 (81 connections now open)
2016-02-10T15:41:41.213+0000 [conn3668] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:41.213+0000 [conn3668]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:41.213+0000 [conn3668] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:41.348+0000 [initandlisten] connection accepted from 10.240.0.6:46024 #3669 (82 connections now open)
2016-02-10T15:41:41.349+0000 [conn3669] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:41.349+0000 [conn3669]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:41.349+0000 [conn3669] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:43.620+0000 [initandlisten] connection accepted from 10.240.0.6:46055 #3670 (83 connections now open)
2016-02-10T15:41:43.621+0000 [conn3670] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:43.621+0000 [conn3670]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:43.621+0000 [conn3670] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:43.655+0000 [initandlisten] connection accepted from 10.240.0.6:46058 #3671 (84 connections now open)
2016-02-10T15:41:43.656+0000 [conn3671] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:43.656+0000 [conn3671]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:43.656+0000 [conn3671] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:44.045+0000 [initandlisten] connection accepted from 10.240.0.6:46071 #3672 (85 connections now open)
2016-02-10T15:41:44.045+0000 [conn3672] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:44.046+0000 [conn3672]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:44.046+0000 [conn3672] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:44.083+0000 [initandlisten] connection accepted from 10.240.0.6:46073 #3673 (86 connections now open)
2016-02-10T15:41:44.084+0000 [conn3673] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0  reslen:65 0ms
2016-02-10T15:41:44.084+0000 [conn3673]  authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:44.084+0000 [conn3673] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0  reslen:82 0ms
2016-02-10T15:41:44.182+0000 [initandlisten] connection accepted from 10.240.0.6:46076 #3674 (87 connections now open)
Run Code Online (Sandbox Code Playgroud)

收集信息

目前我们的数据库包含163 个集合。重要的是messages,columncases,这是进行大量插入、更新和查询的那些。其余的如果用于分析,并且是许多集合,每个集合大约有 100 条记录:

{
    "ns" : "db.message",
    "count" : 2.96615e+06,
    "size" : 3906258304.0000000000000000,
    "avgObjSize" : 1316,
    "storageSize" : 9305935856.0000000000000000,
    "numExtents" : 25,
    "nindexes" : 21,
    "lastExtentSize" : 2.14643e+09,
    "paddingFactor" : 1.0530000000000086,
    "systemFlags" : 0,
    "userFlags" : 1,
    "totalIndexSize" : 7952525392.0000000000000000,
    "indexSizes" : {
        "_id_" : 1.63953e+08,
        "client_id_1_sn_1_mid_1" : 3.16975e+08,
        "client_id_1_created_1" : 1.89086e+08,
        "client_id_1_recipients_1_created_1" : 4.3861e+08,
        "client_id_1_author_1_created_1" : 2.29713e+08,
        "client_id_1_kind_1_created_1" : 2.37088e+08,
        "client_id_1_answered_1_created_1" : 1.90934e+08,
        "client_id_1_is_mention_1_created_1" : 1.8674e+08,
        "client_id_1_has_custom_data_1_created_1" : 1.9566e+08,
        "client_id_1_assigned_1_created_1" : 1.86838e+08,
        "client_id_1_published_1_created_1" : 1.94352e+08,
        "client_id_1_sn_1_created_1" : 2.3681e+08,
        "client_id_1_thread_root_1" : 1.88089e+08,
        "client_id_1_case_id_1" : 1.89266e+08,
        "client_id_1_sender_id_1" : 1.5182e+08,
        "client_id_1_recipient_id_1" : 1.49711e+08,
        "client_id_1_mid_1_sn_1" : 3.17662e+08,
        "text_text_created_1" : 3320641520.0000000000000000,
        "client_id_1_sn_1_kind_1_recipient_id_1_created_1" : 3.15226e+08,
        "client_id_1_sn_1_thread_root_1_created_1" : 3.06526e+08,
        "client_id_1_case_id_1_created_1" : 2.46825e+08
    },
    "ok" : 1.0000000000000000
}

{
    "ns" : "db.case",
    "count" : 497661,
    "size" : 5.33111e+08,
    "avgObjSize" : 1071,
    "storageSize" : 6.29637e+08,
    "numExtents" : 16,
    "nindexes" : 34,
    "lastExtentSize" : 1.68743e+08,
    "paddingFactor" : 1.0000000000000000,
    "systemFlags" : 0,
    "userFlags" : 1,
    "totalIndexSize" : 8.46012e+08,
    "indexSizes" : {
        "_id_" : 2.30073e+07,
        "client_id_1" : 1.99985e+07,
        "is_closed, deleted_1" : 1.31061e+07,
        "is_closed_1" : 1.36948e+07,
        "sn_1" : 2.1274e+07,
        "deleted_1" : 1.39728e+07,
        "created_1" : 1.97777e+07,
        "current_assignment_1" : 4.20819e+07,
        "assigned_1" : 1.33678e+07,
        "commented_1" : 1.36049e+07,
        "has_custom_data_1" : 1.42426e+07,
        "sentiment_start_1" : 1.36049e+07,
        "sentiment_finish_1" : 1.37275e+07,
        "updated_time_1" : 2.02192e+07,
        "identifier_1" : 1.73822e+07,
        "important_1" : 1.38256e+07,
        "answered_1" : 1.41772e+07,
        "client_id_1_is_closed_1_deleted_1_updated_time_1" : 2.90248e+07,
        "client_id_1_is_closed_1_updated_time_1" : 2.86569e+07,
        "client_id_1_sn_1_updated_time_1" : 3.58436e+07,
        "client_id_1_deleted_1_updated_time_1" : 2.8477e+07,
        "client_id_1_updated_time_1" : 2.79619e+07,
        "client_id_1_current_assignment_1_updated_time_1" : 5.6071e+07,
        "client_id_1_assigned_1_updated_time_1" : 2.87713e+07,
        "client_id_1_commented_1_updated_time_1" : 2.86896e+07,
        "client_id_1_has_custom_data_1_updated_time_1" : 2.88286e+07,
        "client_id_1_sentiment_start_1_updated_time_1" : 2.87223e+07,
        "client_id_1_sentiment_finish_1_updated_time_1" : 2.88776e+07,
        "client_id_1_identifier_1_updated_time_1" : 3.48216e+07,
        "client_id_1_important_1_updated_time_1" : 2.88776e+07,
        "client_id_1_answered_1_updated_time_1" : 2.85669e+07,
        "client_id_1_establishment_users_1_updated_time_1" : 3.93838e+07,
        "client_id_1_identifier_1" : 1.86413e+07,
        "client_id_1_sn_1_users_1_updated_time_1" : 4.47309e+07
    },
    "ok" : 1.0000000000000000
}
{
    "ns" : "db.column",
    "count" : 438,
    "size" : 218672,
    "avgObjSize" : 499,
    "storageSize" : 696320,
    "numExtents" : 4,
    "nindexes" : 2,
    "lastExtentSize" : 524288,
    "paddingFactor" : 1.0000000000000000,
    "systemFlags" : 0,
    "userFlags" : 1,
    "totalIndexSize" : 65408,
    "indexSizes" : {
        "_id_" : 32704,
        "client_id_1_owner_1" : 32704
    },
    "ok" : 1.0000000000000000
}
Run Code Online (Sandbox Code Playgroud)

蒙古斯塔特

以下是我们在正常操作期间运行 mongostat 的一些行:

insert  query update delete getmore command flushes mapped  vsize    res faults        locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    *0     34      2     *0       0    10|0       0  32.6g  65.5g  1.18g      0 db:0.1%          0       0|0     0|0     4k    39k    87   20:44:44
     2     31     13     *0       0     7|0       0  32.6g  65.5g  1.17g      3 db:0.8%          0       0|0     0|0     9k    36k    87   20:44:45
     1     18      2     *0       0     5|0       0  32.6g  65.5g  1.12g      0 db:0.4%          0       0|0     0|0     3k    18k    87   20:44:46
     5    200     57     *0       0    43|0       0  32.6g  65.5g  1.13g     12 db:2.3%          0       0|0     0|0    46k   225k    86   20:44:47
     1     78     23     *0       0     5|0       0  32.6g  65.5g  1.01g      1 db:1.6%          0       0|0     0|0    18k   313k    86   20:44:48
    *0     10      1     *0       0     5|0       0  32.6g  65.5g  1004m      0 db:0.2%          0       0|0     1|0     1k     8k    86   20:44:49
     3     48     23     *0       0    11|0       0  32.6g  65.5g  1.05g      4 db:1.1%          0       0|0     0|0    16k    48k    86   20:44:50
     2     38     13     *0       0     8|0       0  32.6g  65.5g  1.01g      8 db:0.9%          0       0|0     0|0    10k    76k    86   20:44:51
     3     28     16     *0       0     9|0       0  32.6g  65.5g  1.01g      7 db:1.1%          0       0|0     1|0    11k    62k    86   20:44:52
    *0      9      4     *0       0     8|0       0  32.6g  65.5g  1022m      1 db:0.4%          0       0|0     0|0     3k     6k    87   20:44:53
insert  query update delete getmore command flushes mapped  vsize    res faults        locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     3    107     34     *0       0     

clo*_*pez 1

问题是我们的 Gunicorn 配置有同步工作人员。在某些时候,工作人员陷入死锁,但仍然为新请求建立新连接,导致 mysql 和 mongodb 冻结。

我们将 Gunicorn 从 19.1.1 更新到 19.6.0,并切换到异步 greenlet 工作线程。还添加multi_accept on;到 nginx 配置文件中。