Project

General

Profile

Feature #789

Charmrun should test for SSH failures when node programs fail to launch

Added by Phil Miller over 3 years ago. Updated 7 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
07/09/2015
Due date:
% Done:

0%


Description

When charmrun sees a failure in ssh-ing to a compute node, the fault could either be in the SSH connection or in the node script or compute process itself. To distinguish these cases, charmrun could try a really simple ssh connection running something like echo hello. If that works, SSH is happy and the problem is in the script or executable; if it fails, then we can unambiguously report that the problem lies with SSH.


Related issues

Related to Charm++ - Bug #709: Net charmrun should disable SSH password authentication Merged 03/19/2015

History

#1 Updated by Nikhil Jain over 3 years ago

  • Assignee set to Phil Miller

#2 Updated by Nikhil Jain about 3 years ago

  • Target version changed from 6.7.0 to 6.7.1

#3 Updated by Sam White over 2 years ago

  • Target version changed from 6.7.1 to 6.8.0

#4 Updated by Phil Miller over 1 year ago

  • Target version changed from 6.8.0 to 6.8.1

It'd be nice to have, but nothing is broken by missing this.

#5 Updated by Eric Bohm over 1 year ago

  • Target version changed from 6.8.1 to 6.9.0

#6 Updated by Eric Bohm about 1 year ago

  • Assignee changed from Phil Miller to Evan Ramos

#7 Updated by Evan Ramos 11 months ago

Would doing this imply that we create an SSH connection just to run `echo hello` or similar as a diagnostic, then disconnect, create a new one, and proceed as before? If so, this has the potential to double the time it takes to launch. If not, I don't immediately understand how we could achieve the separation of fault domains.

#8 Updated by Sam White 11 months ago

I think the idea is to flip the ordering of what you said: first try to actually connect, then if there is an error do the 'echo hello' test to narrow down the cause of the error

#9 Updated by Evan Ramos 11 months ago

  • Status changed from New to Feedback

The following lines are printed by the script run on the remote ssh instance with ++verbose:

Charmrun remote shell(localhost1.0)> remote responding...
Charmrun remote shell(localhost1.0)> starting node-program...
Charmrun remote shell(localhost1.0)> remote shell phase successful.

I believe the presence or absence of these lines should be sufficient to diagnose the problem as described in this issue. Feedback?

#10 Updated by Evan Ramos 10 months ago

  • Target version deleted (6.9.0)

Removing version tag because I consider this issue satisfied, but am waiting on feedback before marking it closed.

#11 Updated by Evan Ramos 7 months ago

  • Target version set to 6.9.0
  • Status changed from Feedback to Implemented

++verbose is definitely enough to diagnose the described problem, but I've added this fact to the documentation in order to satisfy this issue: https://charm.cs.illinois.edu/gerrit/4163

#12 Updated by Sam White 7 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF