AWS EC2中的服务分段错误

Question

AWS EC2中的服务分段错误

Art*_*nov 9 .net amazon-ec2 amazon-web-services .net-core asp.net-core

我的服务在EC2中运行（在systemd下）。这是一个为.Net Core 2.1构建的独立应用程序。有时（一周几次）会随着SEGV崩溃。

4月30日21:20:51 ip-10-4-226-55内核：陷阱：App.Name [26176]常规保护ip：7f22da3609da sp：7f1fedf11510 libc-2.26.so中的错误0：[7f22da2e3000 + 1ad000]

4月30日21:20:51 ip-10-4-226-55 systemd：appname.service：主进程已退出，代码=已终止，状态= 11 / SEGV

4月30日21:20:51 ip-10-4-226-55 systemd：单元appname.service进入失败状态。

4月30日21:20:51 ip-10-4-226-55 systemd：appname.service失败。

由于某种原因，不会创建崩溃转储（即使我删除了大小限制）。我如何进一步调查问题？问题的根源是什么？

Answer 1

小智 3

我怎样才能进一步调查这个问题？

我使用的是 ArchLinux，所以情况可能会有所不同（尽管systemd两者都存在），但这是我会尝试的：

系统是否以某种方式创建任何核心？

让我们转储一个无害的核心来测试：在bashshell 中：

sleep 200 & kill -11 "$!"

Run Code Online (Sandbox Code Playgroud)

这显示了以下内容dmesg -w：

[17894.861369] systemd[1]: Started Process Core Dump (PID 31964/UID 0).
[17895.030166] systemd-coredump[31975]: Process 31963 (bash) of user 1000 dumped core.

               Stack trace of thread 31963:
               #0  0x00007c0aff6c642b kill (libc.so.6)
               #1  0x000056e836d6c56a termsig_handler.part.2 (bash)
               #2  0x000056e836d6c6d3 termsig_handler (bash)
               #3  0x000056e836d3a1b3 execute_simple_command (bash)
               #4  0x000056e836d3b20e execute_command_internal (bash)
               #5  0x000056e836d3b469 execute_command_internal (bash)
               #6  0x000056e836d3cf12 execute_command (bash)
               #7  0x000056e836d247f2 reader_loop (bash)
               #8  0x000056e836d2320d main (bash)
               #9  0x00007c0aff6b21bb __libc_start_main (libc.so.6)
               #10 0x000056e836d235ce _start (bash)

[17895.030324] systemd[1]: systemd-coredump@5-31964-0.service: Succeeded.

Run Code Online (Sandbox Code Playgroud)

并被列为最新coredumpctl -r |head -2：

TIME                            PID   UID   GID SIG COREFILE  EXE
Sat 2019-05-18 21:48:22 CEST  31963  1000  1000  11 present   /usr/bin/bash

Run Code Online (Sandbox Code Playgroud)

还：

$ ls -rlat /var/lib/systemd/coredump/|tail -n1
-rw-r-----+ 1 root root  3907584 18.05.2019 21:48 core.bash.1000.6d7dce73cd2342759a18d47914c16007.31963.1558208902000000

Run Code Online (Sandbox Code Playgroud)

所以，因为它是最新的，我可以直接运行coredumpctl gdb来开始gdb它，然后gdb通过输入以下命令在里面查看一些信息thread apply all bt full：

$ coredumpctl gdb
           PID: 31963 (bash)
           UID: 1000 (user)
           GID: 1000 (user)
        Signal: 11 (SEGV)
     Timestamp: Sat 2019-05-18 21:48:22 CEST (3min 51s ago)
  Command Line: -bash
    Executable: /usr/bin/bash
 Control Group: /user.slice/user-1000.slice/session-1.scope
          Unit: session-1.scope
         Slice: user-1000.slice
       Session: 1
     Owner UID: 1000 (user)
       Boot ID: 6d7dce73cd2342759a18d47914c16007
    Machine ID: 5767ef25f523419aaa049f3d74481940
      Hostname: i87k
       Storage: /var/lib/systemd/coredump/core.bash.1000.6d7dce73cd2342759a18d47914c16007.31963.1558208902000000
       Message: Process 31963 (bash) of user 1000 dumped core.

                Stack trace of thread 31963:
                #0  0x00007c0aff6c642b kill (libc.so.6)
                #1  0x000056e836d6c56a termsig_handler.part.2 (bash)
                #2  0x000056e836d6c6d3 termsig_handler (bash)
                #3  0x000056e836d3a1b3 execute_simple_command (bash)
                #4  0x000056e836d3b20e execute_command_internal (bash)
                #5  0x000056e836d3b469 execute_command_internal (bash)
                #6  0x000056e836d3cf12 execute_command (bash)
                #7  0x000056e836d247f2 reader_loop (bash)
                #8  0x000056e836d2320d main (bash)
                #9  0x00007c0aff6b21bb __libc_start_main (libc.so.6)
                #10 0x000056e836d235ce _start (bash)

GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/bash...done.
[New LWP 31963]
Core was generated by `-bash'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007c0aff6c642b in kill () at ../sysdeps/unix/syscall-template.S:78
78  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (LWP 31963):
#0  0x00007c0aff6c642b in kill () at ../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x000056e836d6c56a in termsig_handler.part ()
No symbol table info available.
#2  0x000056e836d6c6d3 in termsig_handler ()
No symbol table info available.
#3  0x000056e836d3a1b3 in execute_simple_command ()
No symbol table info available.
#4  0x000056e836d3b20e in execute_command_internal ()
No symbol table info available.
#5  0x000056e836d3b469 in execute_command_internal ()
No symbol table info available.
#6  0x000056e836d3cf12 in execute_command ()
No symbol table info available.
#7  0x000056e836d247f2 in reader_loop ()
No symbol table info available.
#8  0x000056e836d2320d in main ()
No symbol table info available.
(gdb)

Run Code Online (Sandbox Code Playgroud)

没什么可看的，因为bash没有使用调试符号或剥离它们的东西进行编译。在执行以下操作之前使用bashextra重新编译：CFLAGS./configure ... && make

[17894.861369] systemd[1]: Started Process Core Dump (PID 31964/UID 0).
[17895.030166] systemd-coredump[31975]: Process 31963 (bash) of user 1000 dumped core.

               Stack trace of thread 31963:
               #0  0x00007c0aff6c642b kill (libc.so.6)
               #1  0x000056e836d6c56a termsig_handler.part.2 (bash)
               #2  0x000056e836d6c6d3 termsig_handler (bash)
               #3  0x000056e836d3a1b3 execute_simple_command (bash)
               #4  0x000056e836d3b20e execute_command_internal (bash)
               #5  0x000056e836d3b469 execute_command_internal (bash)
               #6  0x000056e836d3cf12 execute_command (bash)
               #7  0x000056e836d247f2 reader_loop (bash)
               #8  0x000056e836d2320d main (bash)
               #9  0x00007c0aff6b21bb __libc_start_main (libc.so.6)
               #10 0x000056e836d235ce _start (bash)

[17895.030324] systemd[1]: systemd-coredump@5-31964-0.service: Succeeded.

Run Code Online (Sandbox Code Playgroud)

-O0（如果您想保留当前程序的行为，也许您不会想要，否则它可能不会再崩溃）
然后重新运行上面的内容sleep来创建新的核心转储，会产生以下更丰富的结果：

$ coredumpctl gdb
           PID: 29241 (bash)
           UID: 1000 (user)
           GID: 1000 (user)
        Signal: 11 (SEGV)
     Timestamp: Sat 2019-05-18 22:01:41 CEST (13s ago)
  Command Line: -bash
    Executable: /usr/bin/bash
 Control Group: /user.slice/user-1000.slice/session-1.scope
          Unit: session-1.scope
         Slice: user-1000.slice
       Session: 1
     Owner UID: 1000 (user)
       Boot ID: 6d7dce73cd2342759a18d47914c16007
    Machine ID: 5767ef25f523419aaa049f3d74481940
      Hostname: i87k
       Storage: /var/lib/systemd/coredump/core.bash.1000.6d7dce73cd2342759a18d47914c16007.29241.1558209701000000
       Message: Process 29241 (bash) of user 1000 dumped core.

                Stack trace of thread 29241:
                #0  0x00007775d0d2642b kill (libc.so.6)
                #1  0x000060b781bce2c8 termsig_handler (bash)
                #2  0x000060b781b9107b execute_simple_command (bash)
                #3  0x000060b781b8aa1c execute_command_internal (bash)
                #4  0x000060b781b8dde0 execute_connection (bash)
                #5  0x000060b781b8ade5 execute_command_internal (bash)
                #6  0x000060b781b89f45 execute_command (bash)
                #7  0x000060b781b72e66 reader_loop (bash)
                #8  0x000060b781b70906 main (bash)
                #9  0x00007775d0d121bb __libc_start_main (libc.so.6)
                #10 0x000060b781b6fe2e _start (bash)

GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/bash...done.
[New LWP 29241]
Core was generated by `-bash'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007775d0d2642b in kill () at ../sysdeps/unix/syscall-template.S:78
78  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (LWP 29241):
#0  0x00007775d0d2642b in kill () at ../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x000060b781bce2c8 in termsig_handler (sig=11) at sig.c:597
        handling_termsig = 1
        i = -2097452368
        core = 24759
        mask = {__val = {140729597269152, 106341271890333, 106341294191024, 106341271640912, 106341291662592, 106341294178848, 
            140729597269200, 106341271910973, 140729597269200, 106341294191024, 106341271640912, 106341271911463, 
            106341272462763, 0, 140729597269232, 106341271911163}}
#2  0x000060b781b9107b in execute_simple_command (simple_command=0x60b78310a8c0, pipe_in=-1, pipe_out=-1, async=1, 
    fds_to_close=0x60b7831196b0) at execute_cmd.c:4394
        words = 0x60b78310b1b0
        lastword = 0x7ffe29a79910
        command_line = 0x0
        lastarg = 0x0
        temp = 0x0
        first_word_quoted = 0
        result = 0
        builtin_is_special = 0
        already_forked = 1
        dofork = 1
        old_last_async_pid = -1
        builtin = 0x0
        func = 0x0
        old_builtin = 0
        old_command_builtin = -2098586400
#3  0x000060b781b8aa1c in execute_command_internal (command=0x60b783107410, asynchronous=1, pipe_in=-1, pipe_out=-1, 
    fds_to_close=0x60b7831196b0) at execute_cmd.c:845
        exec_result = 0
        user_subshell = 0
        invert = 0
        ignore_return = 0
        was_error_trap = 0
        my_undo_list = 0x0
        exec_undo_list = 0x0
        tcmd = 0x0
        save_line_number = 1
        ofifo = 0
        nfifo = 0
        osize = 0
        saved_fifo = 0
        ofifo_list = 0x5b0000006e <error: Cannot access memory at address 0x5b0000006e>
#4  0x000060b781b8dde0 in execute_connection (command=0x60b783119680, asynchronous=0, pipe_in=-1, pipe_out=-1, 
    fds_to_close=0x60b7831196b0) at execute_cmd.c:2690
        tc = 0x60b783107410
--Type <RET> for more, q to quit, c to continue without paging--c
        second = 0x0
        ignore_return = 0
        exec_result = -2098586400
        was_error_trap = 0
        invert = 3
        save_line_number = 0
#5  0x000060b781b8ade5 in execute_command_internal (command=0x60b783119680, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x60b7831196b0) at execute_cmd.c:1018
        exec_result = 0
        user_subshell = 0
        invert = 0
        ignore_return = 0
        was_error_trap = 32766
        my_undo_list = 0x0
        exec_undo_list = 0x0
        tcmd = 0x0
        save_line_number = -2117800288
        ofifo = 24759
        nfifo = -2096071056
        osize = 24759
        saved_fifo = 0
        ofifo_list = 0x60b781b89d9c <dispose_fd_bitmap> "UH\211\345H\203\354\020H\211}\370H\213E\370H\213@\bH\205\300t\020H\213E\370H\213@\bH\211\307\350\253R\376\377H\213E\370H\211\307\350\237R\376\377\220\311\303UH\211\345SH\203\354\030H\211}\350H\203", <incomplete sequence \350>
#6  0x000060b781b89f45 in execute_command (command=0x60b783119680) at execute_cmd.c:394
        bitmap = 0x60b7831196b0
        result = 0
#7  0x000060b781b72e66 in reader_loop () at eval.c:175
        code = 0
        our_indirection_level = 1
        current_command = 0x60b783119680
#8  0x000060b781b70906 in main (argc=1, argv=0x7ffe29a79918, env=0x7ffe29a79928) at shell.c:805
        i = 20
        code = 0
        old_errexit_flag = 0
        saverst = 0
        locally_skip_execution = 0
        arg_index = 1
        top_level_arg_index = 1
(gdb)

Run Code Online (Sandbox Code Playgroud)

但是，可能会创建 coredump，但 systemd 可能会在一段时间后清理/删除它（例如，我 3 天前之前的所有 coredump 均如missing报告所述coredumpctl- 不确定为什么，考虑到我的设置 - 也许您遇到类似的问题？），或者由于空间限制甚至不创建它（参见/etc/systemd/coredump.conf下面提到的所有内容）。
让我们看看：甚至
设置systemd-coredump为运行来创建核心转储吗？

$ sysctl -a |grep kernel.core
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
kernel.core_pipe_limit = 0
kernel.core_uses_pid = 1
$ ls -la /usr/lib/systemd/systemd-coredump
-rwxr-xr-x 1 root root 55296 13.05.2019 11:46 /usr/lib/systemd/systemd-coredump*

Run Code Online (Sandbox Code Playgroud)

内核支持 coredump-ing 吗？

$ zcat /proc/config.gz |grep -i 'core.*dump'
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_COREDUMP=y
CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_PROC_VMCORE_DEVICE_DUMP is not set

Run Code Online (Sandbox Code Playgroud)

CONFIG_COREDUMP=y可能就足够了。

我会看的其他事情：

$ systemctl|grep core
systemd-coredump.socket                                                                  loaded active listening Process Core Dump Socket

Run Code Online (Sandbox Code Playgroud)

$ cat /etc/systemd/coredump.conf
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See coredump.conf(5) for details.

[Coredump]
#Storage=external
#Compress=yes
Compress=no
#ProcessSizeMax=2G
ProcessSizeMax=10G
#ExternalSizeMax=2G
ExternalSizeMax=10G
#JournalSizeMax=767M
JournalSizeMax=10G
#MaxUse=
#KeepFree=

Run Code Online (Sandbox Code Playgroud)

man 5 coredump.conf显示一些信息：

       All options are configured in the "[Coredump]" section:

       Storage=
           Controls where to store cores. One of "none", "external", and "journal". When "none", the core dumps may be
           logged (including the backtrace if possible), but not stored permanently. When "external" (the default), cores
           will be stored in /var/lib/systemd/coredump/. When "journal", cores will be stored in the journal and rotated
           following normal journal rotation patterns.

           When cores are stored in the journal, they might be compressed following journal compression settings, see
           journald.conf(5). When cores are stored externally, they will be compressed by default, see below.

       Compress=
           Controls compression for external storage. Takes a boolean argument, which defaults to "yes".

       ProcessSizeMax=
           The maximum size in bytes of a core which will be processed. Core dumps exceeding this size may be stored, but
           the backtrace will not be generated.

           Setting Storage=none and ProcessSizeMax=0 disables all coredump handling except for a log entry.

       ExternalSizeMax=, JournalSizeMax=
           The maximum (uncompressed) size in bytes of a core to be saved.

       MaxUse=, KeepFree=
           Enforce limits on the disk space taken up by externally stored core dumps.  MaxUse= makes sure that old core
           dumps are removed as soon as the total disk space taken up by core dumps grows beyond this limit (defaults to 10%
           of the total disk size).  KeepFree= controls how much disk space to keep free at least (defaults to 15% of the
           total disk size). Note that the disk space used by core dumps might temporarily exceed these limits while core
           dumps are processed. Note that old core dumps are also removed based on time via systemd-tmpfiles(8). Set either
           value to 0 to turn off size-based clean-up.

       The defaults for all values are listed as comments in the template /etc/systemd/coredump.conf file that is installed
       by default.

Run Code Online (Sandbox Code Playgroud)

TIME                            PID   UID   GID SIG COREFILE  EXE
Sat 2019-05-18 21:48:22 CEST  31963  1000  1000  11 present   /usr/bin/bash

Run Code Online (Sandbox Code Playgroud)

这些似乎对我来说是有效的。（如果更改则sudo systemctl daemon-reload需要a）

另请参阅：man 8 systemd-coredump其中表示核心转储保存在中/var/lib/systemd/coredump，您甚至可能会找到其他有用的信息（以及重定向到man 5 core）

我改变的另一件事是：

$ colordiff -up /etc/security/limits.conf.ORIG /etc/security/limits.conf
--- /etc/security/limits.conf.ORIG  2017-12-29 21:26:09.000000000 +0100
+++ /etc/security/limits.conf   2017-12-29 21:26:09.000000000 +0100
@@ -47,4 +47,11 @@
 #ftp             hard    nproc           0
 #@student        -       maxlogins       4

+#*               soft    core            unlimited
+#^ this doesn't affect the root user!! what the!
+#@root               soft    core            unlimited
+0:               soft    core            unlimited
+#^ all uids from 0 upwards! so what I thought * was doing!
+#hmm works with su -, but not with ssh !
+
 # End of file

Run Code Online (Sandbox Code Playgroud)

IE。我正在使用这一行：
0: soft core unlimited
而不是通常推荐的一行：
* soft core unlimited
尽管我现在注意到 Arch Linux推荐：
* hard core 0

我要做的另一件事是使用完整的调试和符号重新编译 glibc，以便下次程序崩溃时它们可用in libc-2.26.so。我这样做的方法是确保strip（来自PKGBUILD）不运行并且我使用：

CPPFLAGS="${CPPFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments -O2"
CXXFLAGS="${CXXFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments"
CFLAGS="${CFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments"

Run Code Online (Sandbox Code Playgroud)

如果你仍然没有得到核心转储（对于你的程序！），也许看看/proc/<pid>/coredump_filter内核Documentation/filesystems/proc.txt

更新：因为您只有一条 dmesg 行（并且没有 coredump），所以也许这个答案可以帮助您获取一些信息。您可能需要 CentOS 正在使用的 glibc 2.26 的源代码，除非您只愿意阅读汇编代码；）

UPDATE2：尝试运行coredumpctl 26176，即使它没有核心，您仍然应该看到堆栈跟踪，例如：

$ coredumpctl -S '2019-05-04 23:37:56' -U '2019-05-05 23:37:56'
TIME                            PID   UID   GID SIG COREFILE  EXE
Sat 2019-05-04 23:37:56 CEST   3888     0     0   7 missing   /usr/bin/mc
Sat 2019-05-04 23:40:08 CEST   3916     0     0   7 missing   /usr/bin/mc
$ coredumpctl info 3888
           PID: 3888 (mc)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 7 (BUS)
     Timestamp: Sat 2019-05-04 23:37:56 CEST (2 weeks 0 days ago)
  Command Line: mc
    Executable: /usr/bin/mc
 Control Group: /user.slice/user-0.slice/session-5.scope
          Unit: session-5.scope
         Slice: user-0.slice
       Session: 5
     Owner UID: 0 (root)
       Boot ID: ce932e7af1f04bc3af1c9573c70a912d
    Machine ID: 5767ef25f523419aaa049f3d74481940
      Hostname: i87k
       Storage: /var/lib/systemd/coredump/core.mc.0.ce932e7af1f04bc3af1c9573c70a912d.3888.1557005876000000 (inaccessible)
       Message: Process 3888 (mc) of user 0 dumped core.

                Stack trace of thread 3888:
                #0  0x00007f54782d427e __memcmp_avx2_movbe (libc.so.6)
                #1  0x000055db1382fdad n/a (mc)
                #2  0x000055db137cb126 n/a (mc)
                #3  0x000055db1380102d n/a (mc)
                #4  0x000055db13801bff n

归档时间：	6 年，9 月前
查看次数：	385 次
最近记录：	6 年，9 月前