如何在 Slurm 中处理作业取消？

Question

如何在 Slurm 中处理作业取消？

我正在HPC 集群上使用Slurm作业管理器。有时会出现一些情况，当工作因时间限制而被取消时，我想优雅地完成我的计划。

据我了解，取消过程分两个阶段进行，以便软件开发人员能够优雅地完成程序：

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.                                                                                                                           
slurmstepd: error: *** JOB 18522559 ON ncm0317 CANCELLED AT 2020-12-14T19:42:43 DUE TO TIME LIMIT ***

Run Code Online (Sandbox Code Playgroud)

您可以看到，我有 62 秒的时间按照我希望的方式完成工作（通过保存一些文件等）。

问题：如何做到这一点？我知道首先一些 Unix 信号被发送到我的工作，我需要正确地响应它。但是，我在 Slurm 文档中找不到有关此信号是什么的任何信息。此外，我不知道如何在Python中处理它，可能是通过异常处理。

Answer 1

dam*_*ois 8

在 Slurm 中，您可以决定在作业达到时间限制之前的哪个时刻发送哪个信号。

从sbatch 手册页：

--signal=[[R][B]:]<sig_num>[@<sig_time>] 当作业在其结束时间的 sig_time 秒内时，向其发送信号 sig_num。

这样设置

#SBATCH --signal=B:TERM@05:00

Run Code Online (Sandbox Code Playgroud)

SIGTERM让 Slurm在分配结束前 5 分钟发出作业信号。请注意，根据您开始工作的方式，您可能需要删除该B:部件。

在您的 Python 脚本中，使用该signal包。您需要定义一个“信号处理程序”，即接收信号时将调用的函数，并为特定信号“注册”该函数。由于该函数在调用时会破坏正常流程，因此您需要使其简短以避免不必要的副作用，尤其是对于多线程代码。

Slurm 环境中的典型方案是具有如下所示的脚本框架：

#! /bin/env python

import signal, os, sys

# Global Boolean variable that indicates that a signal has been received
interrupted = False

# Global Boolean variable that indicates then natural end of the computations
converged = False

# Definition of the signal handler. All it does is flip the 'interrupted' variable
def signal_handler(signum, frame):
    global interrupted
    interrupted = True

# Register the signal handler
signal.signal(signal.SIGTERM, signal_handler)

try:
    # Try to recover a state file with the relevant variables stored
    # from previous stop if any
    with open('state', 'r') as file: 
        vars = file.read()
except:
    # Otherwise bootstrap (start from scratch)
    vars = init_computation()

while not interrupted and not converged:
    do_computation_iteration()    

# Save current state 
if interrupted:
    with open('state', 'w') as file: 
        file.write(vars)
    sys.exit(99)
sys.exit(0)

Run Code Online (Sandbox Code Playgroud)

这首先尝试重新启动上次运行作业留下的计算，否则引导它。如果它被中断，它会让当前循环迭代正确完成，然后将所需的变量保存到磁盘。然后退出并返回代码 99。如果为其配置了 Slurm，这允许自动重新排队作业以进行进一步的迭代。

如果没有配置 slurm，您可以在提交脚本中手动执行此操作，如下所示：

python myscript.py || scontrol requeue $SLURM_JOB_ID

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，1 月前
查看次数：	4407 次
最近记录：	5 年，1 月前