MySQL 在打开大约 70k 个文件时崩溃

the*_*uts 5 mysql monitoring mysql-8.0

运行社区 8.0.14 的服务器,几天前我们达到了打开文件的数量,所以我增加了限制,但过了一会儿,服务器在错误日志中崩溃了:

InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
03:07:21 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=221
max_threads=1000
thread_count=4
connection_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 403301 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f9200000b20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fd571ceaca0 thread_stack 0x46000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char*, unsigned long)+0x3d) [0x561bdb69757d]
/usr/sbin/mysqld(handle_fatal_signal+0x423) [0x561bda8277c3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7fe5095f6890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7) [0x7fe507981e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141) [0x7fe507983801]
/usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x2d5) [0x561bdb94d2c5]
/usr/sbin/mysqld(+0x1fe243c) [0x561bdb94f43c]
/usr/sbin/mysqld(mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0x9ba) [0x561bdb95638a]
/usr/sbin/mysqld(mem_heap_add_block(mem_block_info_t*, unsigned long)+0x72) [0x561bdb956562]
/usr/sbin/mysqld(trx_undo_update_rec_get_update(unsigned char const*, dict_index_t*, unsigned long, unsigned long, unsigned long, unsigned long, trx_t*, mem_block_info_t*, upd_t**, lob::undo_vers_t*, type_cmpl_t&)+0x1926) [0x561bdb924896]
/usr/sbin/mysqld(row_purge_step(que_thr_t*)+0x845) [0x561bdb8b7ad5]
/usr/sbin/mysqld(que_run_threads(que_thr_t*)+0x997) [0x561bdb859597]
/usr/sbin/mysqld(srv_worker_thread()+0x29d) [0x561bdb8e85bd]
/usr/sbin/mysqld(std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run()+0x63) [0x561bdb759883]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd57f) [0x7fe5083a757f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe5095eb6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe507a6488f]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): Connection ID (thread ID): 0
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2019-01-27T03:07:37.554520Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.14) starting as process 56489

InnoDB: Progress in percents: 1InnoDB MEMCACHED: Memcached uses atomic increment
 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 1002019-01-27T03:12:07.973920Z 0 [System] [MY-010229] [Server] Starting crash recovery...
2019-01-27T03:12:07.982255Z 0 [System] [MY-010232] [Server] Crash recovery finished.
2019-01-27T03:12:10.581437Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2
Run Code Online (Sandbox Code Playgroud)

机器为2x Xeon E5-2680v4 - 28c/56t - 2.4GHz /3.3GHz,512GB RAM,配置文件:

server-id = 1
log-bin   = /var/lib/mysql-binlogs/web07
relay-log = web07-relay-bin

binlog_expire_logs_seconds = 259200

open_files_limit = 10000
max_connections = 1000

innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=38G

innodb_buffer_pool_size=300G
innodb_buffer_pool_instances=64

join_buffer_size=5M
tmp_table_size=1G
max_heap_table_size=512M
skip_name_resolve=1

table_open_cache=20000

sql-mode="ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,NO_AUTO_VALUE_ON_ZERO"

relay-log-recovery=1

default-authentication-plugin=mysql_native_password
Run Code Online (Sandbox Code Playgroud)

在那次崩溃之后,如果我在打开的文件变得太多(大约每 24 小时一次)时不重新启动它,服务器就会再次崩溃,例如:

$ mysql -e "show global status like 'open_%';"
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| Open_files               | 62507 |
| Open_streams             | 0     |
| Open_table_definitions   | 1843  |
| Open_tables              | 18779 |
| Opened_files             | 71902 |
| Opened_table_definitions | 3168  |
| Opened_tables            | 18803 |
+--------------------------+-------+
Run Code Online (Sandbox Code Playgroud)

在我写这个问题时,服务器崩溃了:

2019-01-29T22:27:51.031406Z 54298197 [ERROR] [MY-013132] [Server] The table '/tmp/#sql9ae8_33c8655_13' is full!

2019-01-29T22:28:01.844610Z 54321796 [ERROR] [MY-013132] [Server] The table '/tmp/#sql9ae8_33ce284_1' is full!

如何调试为什么有这么多打开的文件?流量很大,但通常打开的连接不超过10个,那么为什么临时表没有关闭?

在此处输入图片说明

lsof:(总共约 27k 行,重启后 10 小时)

COMMAND   PID  USER   FD      TYPE             DEVICE     SIZE/OFF       NODE NAME
mysqld  23964 mysql  cwd       DIR                9,3         4096   49549201 /var/lib/mysql
mysqld  23964 mysql  rtd       DIR                9,3         4096          2 /
mysqld  23964 mysql  txt       REG                9,3     56952368    6292097 /usr/sbin/mysqld
mysqld  23964 mysql  DEL       REG                9,3                12858561 /tmp/mysql_temptable.As3HWG
mysqld  23964 mysql  DEL       REG                9,3                12858557 /tmp/mysql_temptable.I4gP0F
mysqld  23964 mysql  DEL       REG                9,3                12858554 /tmp/mysql_temptable.W139aF
mysqld  23964 mysql  DEL       REG                9,3                12858553 /tmp/mysql_temptable.MhBwlE
mysqld  23964 mysql  DEL       REG                9,3                12858551 /tmp/mysql_temptable.mcw0vD
mysqld  23964 mysql  DEL       REG                9,3                12858550 /tmp/mysql_temptable.OFuZ94
mysqld  23964 mysql  DEL       REG                9,3                12858544 /tmp/mysql_temptable.2ffRvY
mysqld  23964 mysql  DEL       REG                9,3                12858549 /tmp/mysql_temptable.VvXQPw
mysqld  23964 mysql  DEL       REG                9,3                12858541 /tmp/mysql_temptable.KTTZRR
mysqld  23964 mysql  DEL       REG                9,3                12858542 /tmp/mysql_temptable.eMVSbq
mysqld  23964 mysql  DEL       REG                9,3                12858538 /tmp/mysql_temptable.6Kaay
Run Code Online (Sandbox Code Playgroud)

在第一行之后,所有文件都被(删除)。我认为这是问题所在,为什么不释放这些指针?

mysqld  27864 mysql 1658u      REG                9,3      1048584   12846342 /tmp/mysql_temptable.wazlKx (deleted)
mysqld  27864 mysql 1660u      REG                9,3      1048584   12846352 /tmp/mysql_temptable.S5Oo1m (deleted)
mysqld  27864 mysql 1662u      REG                9,3      1048584   12846343 /tmp/mysql_temptable.l1CZjK (deleted)
mysqld  27864 mysql 1666u      REG                9,3      1048584   12846366 /tmp/mysql_temptable.VDaTxb (deleted)
Run Code Online (Sandbox Code Playgroud)

2019-02-19:我按照建议增加了一些限制,并做了一个mysqlcheck --all-databases,没有任何变化。

正如评论中所写,即使我停止流量,打开文件的数量也不会减少。

该应用程序被证明是稳定的,并且在问题出现之前的几天内没有对代码或流量进行重大更改。(我 100% 肯定,因为那天是假期,并且 master 分支上没有提交)

the*_*uts 6

问题是 Mysql 错误(https://bugs.mysql.com/bug.php?id=94185

无论如何,一些调整缓解了这个问题并阻止了打开文件数量的增长。

具体来说,我们设置:

innodb_lru_scan_depth=100
innodb_flushing_avg_loops=5
thread_cache_size=216
Run Code Online (Sandbox Code Playgroud)

已在即将发布的 8.0.16 版本中修复,这是更改日志条目:

定义在错误范围内的静态线程局部变量在线程退出时未释放。