使用 OnFailure= 和 Restart= 混淆 systemd 行为

Question

使用 OnFailure= 和 Restart= 混淆 systemd 行为

我在嵌入式系统中使用 systemd 231，并且我正在尝试创建一个服务来监视系统中的硬件组件。这是我正在尝试做的事情的粗略描述：

当服务 ,foo.service启动时，它会启动一个应用程序foo_app。
foo_app监控硬件组件，持续运行。
如果foo_app检测到硬件故障，它会以返回码 1 退出。这应该会触发系统重新启动。
如果foo_app崩溃，systemd 应该重新启动foo_app。
如果foo_app 反复崩溃，systemd 应该重新启动系统。

这是我将其实现为服务的尝试：

[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal

Run Code Online (Sandbox Code Playgroud)

从文档（systemd.service和systemd.service）来看，我希望如果我foo_app以Restart=on-abnormal触发的方式（例如killall -9 foo_app）杀死， systemd 应该优先处理Restart=on-abnormaloverOnFailure=systemd-reboot.service而不是 start systemd-reboot.service。

然而，这不是我所看到的。只要我杀了foo_app一次，系统就会立即重新启动。

以下是文档中的一些相关片段：

OnFailure=

当该单元进入“失败”状态时激活的一个或多个单元的空格分隔列表。只有在达到启动限制后，使用 Restart= 的服务单元才会进入失败状态。

Restart=

[snip]请注意，服务重启受单位启动速率限制的限制，配置为 StartLimitIntervalSec= 和 StartLimitBurst=，详情请参见 systemd.unit(5)。只有在达到启动限制后，重新启动的服务才会进入失败状态。

文档似乎很清楚：

中指定的服务OnFailure应仅在服务进入“ failed”状态时运行
服务应该只进入“failed后”状态StartLimitIntervalSec和StartLimitBurst满意。

这不是我所看到的。

为了确认这一点，我将我的服务文件编辑为以下内容：

[Unit]
Description=Foo Hardware Monitor  
  
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal

Run Code Online (Sandbox Code Playgroud)

通过删除OnFailure和设置StartLimitAction=none，我能够看到 systemd 如何响应foo_app死亡。这是一个测试，我反复foo_app使用SIGKILL.

[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'

Run Code Online (Sandbox Code Playgroud)

这是有道理的或大部分。当foo_app被杀死时，systemd 会重新启动它直到StartLimitBurst被击中然后放弃。这就是我想要的，除了StartLimitAction=reboot.

不寻常的是 systemdfoo.service: Unit entered failed state.每当foo_app被杀死时都会打印，即使它即将通过Restart=on-abnormal. 这似乎与上面引用的文档中的这些行直接矛盾：

只有在达到启动限制后，使用 Restart= 的服务单元才会进入失败状态。

只有在达到启动限制后，重新启动的服务才会进入失败状态。

所有这些都让我很困惑。我是否误解了这些 systemd 选项中的任何一个？这是系统错误吗？任何帮助表示赞赏。

Answer 1

cun*_*mp3 9

编辑 2019/08/12：根据therealjumbo的评论，此修复程序已与systemdv239合并并发布，因此，如果由于您的发行版（看着您的 CentOS）而没有固定到某个版本，请尽情享受吧！

TL;DR - 已知的文档问题，目前该systemd项目仍然是一个悬而未决的问题

事实证明，既然你问到这个问题，这已被报告并确定为差异在systemd文档和实际行为之间。根据我的理解（以及我对 github 问题的阅读），您的期望与文档相符，所以您并不疯狂。

当前systemd在每次尝试启动后都将状态设置为失败，无论是否达到启动限制。在这个问题上，OP 写了一个关于学习骑自行车的有趣轶事，我强烈建议你看看。

归档时间：	8 年，3 月前
查看次数：	10784 次
最近记录：	5 年，10 月前