Predicting the PID of previously started SSH command

Mal*_*ppa 4 bash ssh ssh-tunneling shell-script

This is the weirdest thing.

In a script, I start an SSH tunnel like so:

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar
Run Code Online (Sandbox Code Playgroud)

This starts an ssh instance that goes into the background, and script execution continues. Next, I save its PID (for killing it later) by using bash's $! variable. For this to work, I append & to the ssh command even though it already goes into the background by itself (otherwise $! doesn't contain anything). Thus, for example the following script:

#!/bin/bash

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
echo $!
pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"
Run Code Online (Sandbox Code Playgroud)

outputs

(some ssh output)
28062
28062
Run Code Online (Sandbox Code Playgroud)

...two times the same PID, as expected. But, now, when I execute this exact sequence of commands from the terminal, the PID output by $! is wrong (in the sense that it is not the PID of the ssh instance). From the terminal:

$ ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
[1] 28178
(some ssh output)
$ echo $!
28178
$ pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"
28181
Run Code Online (Sandbox Code Playgroud)

It's not always 3 numbers apart, either. I've also observed a difference of 1 or 2. But it is never the same PID, as I would have expected and as is indeed the case when this sequence of commands is run within a script.

  1. Can someone explain why this is happening? I thought it might be due to the initial ssh call actually forking another process, but then why does it work from within a script?

  2. This also made me doubt whether using $! in my script to get the ssh PID as described above will indeed always work (though it has so far). Is this indeed reliable? I felt it was "cleaner" than using pgrep...

Tom*_*unt 7

The shell's $! variable only knows the pid of the process started by the shell. As you suspected, the ssh call using -f forks its own process so it can go to background, so the overall process tree looks like [1]:

shell
|
+--ssh<1> (pid is $!)
   |
   +--ssh<2> (pid is different)
Run Code Online (Sandbox Code Playgroud)

ssh<1> exits very shortly after invocation; therefore, the value in $! is unlikely to be useful. It's ssh<2> which is carrying on the remote communication and doing your tunneling for you, and the only way to reliably get its PID is by examining the process table, as you're doing with pgrep [2]. The pgrep method is likely to be the correct one here.

As to why it works in the script but not interactively, this is probably a race condition. Because you put the first ssh in the background, the shell and ssh are executing concurrently, and ssh does some moderately CPU-heavy cryptographic authentication and some network roundtrips. It's likely that the pgrep you run in the script is simply running before ssh<1> forks itself to go to background. To get around this, run pgrep later, either by a sleep call, or just by calling it only when you actually need the PID later on.

[1]: Technically, it might be more complicated than this, if ssh is using a classic double-fork to background. In that case, there would be another, ephemeral ssh process between the two.

[2]:除非你正在systemd使用 cgroups 或其他东西来跟踪你所有的孩子。你不是。