Charmrun should test for SSH failures when node programs fail to launch
When charmrun sees a failure in ssh-ing to a compute node, the fault could either be in the SSH connection or in the node script or compute process itself. To distinguish these cases, charmrun could try a really simple ssh connection running something like
echo hello. If that works, SSH is happy and the problem is in the script or executable; if it fails, then we can unambiguously report that the problem lies with SSH.
#7 Updated by Evan Ramos over 1 year ago
Would doing this imply that we create an SSH connection just to run `echo hello` or similar as a diagnostic, then disconnect, create a new one, and proceed as before? If so, this has the potential to double the time it takes to launch. If not, I don't immediately understand how we could achieve the separation of fault domains.
#9 Updated by Evan Ramos over 1 year ago
- Status changed from New to Feedback
The following lines are printed by the script run on the remote ssh instance with ++verbose:
Charmrun remote shell(localhost1.0)> remote responding... Charmrun remote shell(localhost1.0)> starting node-program... Charmrun remote shell(localhost1.0)> remote shell phase successful.
I believe the presence or absence of these lines should be sufficient to diagnose the problem as described in this issue. Feedback?
#11 Updated by Evan Ramos about 1 year ago
- Target version set to 6.9.0
- Status changed from Feedback to Implemented
++verbose is definitely enough to diagnose the described problem, but I've added this fact to the documentation in order to satisfy this issue: https://charm.cs.illinois.edu/gerrit/4163