Feature #789

Charmrun should test for SSH failures when node programs fail to launch

Added by Phil Miller about 4 years ago. Updated about 1 year ago.

Target version:
Start date:
Due date:
% Done:



When charmrun sees a failure in ssh-ing to a compute node, the fault could either be in the SSH connection or in the node script or compute process itself. To distinguish these cases, charmrun could try a really simple ssh connection running something like echo hello. If that works, SSH is happy and the problem is in the script or executable; if it fails, then we can unambiguously report that the problem lies with SSH.

Related issues

Related to Charm++ - Bug #709: Net charmrun should disable SSH password authentication Merged 03/19/2015


#1 Updated by Nikhil Jain almost 4 years ago

  • Assignee set to Phil Miller

#2 Updated by Nikhil Jain over 3 years ago

  • Target version changed from 6.7.0 to 6.7.1

#3 Updated by Sam White over 3 years ago

  • Target version changed from 6.7.1 to 6.8.0

#4 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

It'd be nice to have, but nothing is broken by missing this.

#5 Updated by Eric Bohm almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#6 Updated by Eric Bohm over 1 year ago

  • Assignee changed from Phil Miller to Evan Ramos

#7 Updated by Evan Ramos over 1 year ago

Would doing this imply that we create an SSH connection just to run `echo hello` or similar as a diagnostic, then disconnect, create a new one, and proceed as before? If so, this has the potential to double the time it takes to launch. If not, I don't immediately understand how we could achieve the separation of fault domains.

#8 Updated by Sam White over 1 year ago

I think the idea is to flip the ordering of what you said: first try to actually connect, then if there is an error do the 'echo hello' test to narrow down the cause of the error

#9 Updated by Evan Ramos over 1 year ago

  • Status changed from New to Feedback

The following lines are printed by the script run on the remote ssh instance with ++verbose:

Charmrun remote shell(localhost1.0)> remote responding...
Charmrun remote shell(localhost1.0)> starting node-program...
Charmrun remote shell(localhost1.0)> remote shell phase successful.

I believe the presence or absence of these lines should be sufficient to diagnose the problem as described in this issue. Feedback?

#10 Updated by Evan Ramos over 1 year ago

  • Target version deleted (6.9.0)

Removing version tag because I consider this issue satisfied, but am waiting on feedback before marking it closed.

#11 Updated by Evan Ramos about 1 year ago

  • Target version set to 6.9.0
  • Status changed from Feedback to Implemented

++verbose is definitely enough to diagnose the described problem, but I've added this fact to the documentation in order to satisfy this issue:

#12 Updated by Sam White about 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF