重启状态为 down 的节点

pro*_*123 4 centos slurm

断电之后我的节点去陈述下降

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down* node[001-004]
part1*    up      infinite      3   down* node[002-004]
part2     up      infinite      1   down* node001
Run Code Online (Sandbox Code Playgroud)

我执行这些命令

 /etc/init.d/slurm stop
 /etc/init.d/slurm start
Run Code Online (Sandbox Code Playgroud)

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down node[001-004]
part1*    up      infinite      3   down node[002-004]
part2     up      infinite      1   down node001
Run Code Online (Sandbox Code Playgroud)

我怎么能重新启动我的节点?


sinfo -R

REASON USER TIMESTAMP NODELIST Not responding root 2019-07-23T08:40:25 node[001-004]

$ scontrol update nodename=node001 state=idle    
$ scontrol update nodename=node[001-004] state=resume

# the state changes to idle* but for a few seconds then returns to down*

$service --status-all | grep 'slurm' 
slurmctld (pid 24000) is running... slurmdbd (pid 4113) is running...


$systemctl status -l slurm
? slurm.service - LSB: slurm daemon management
   Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-07-24 13:45:38 CEST; 257ms ago
     Docs: man:systemd-sysv-generator(8)
  Process: 30094 ExecStop=/etc/rc.d/init.d/slurm stop (code=exited, status=1/FAILURE)
  Process: 30061 ExecStart=/etc/rc.d/init.d/slurm start (code=exited, status=0/SUCCESS)
 Main PID: 30069 (code=exited, status=1/FAILURE)
Run Code Online (Sandbox Code Playgroud)

dam*_*ois 8

查看它们被标记为 down 的原因sinfo -R。最有可能的是,它们将被列为“意外重启”。您可以使用以下命令恢复它们

scontrol update nodename=node[001-004] state=resume
Run Code Online (Sandbox Code Playgroud)

参数ReturnToService控制slurm.conf计算节点从意外重启中唤醒时是否处于活动状态。


Bub*_*nja 7

启动守护程序后尝试使用此方法:

scontrol update nodename=node001 state=idle

  • 你好,我收到此错误:**slurm_update 错误:无效的用户 ID** (2认同)