Project

General

Profile

Bug #1641

charmrun with nodelist option (++nodelist) fails on campus cluster

Added by Nitin Bhat 5 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
07/24/2017
Due date:
% Done:

0%


Description

[nbhat4@golub162 pingpong]$ ~/gennodelist.sh > ./nodelist && ./charmrun +p2 ++nodelist ./nodelist ./pgm
Charmrun> scalable start enabled.
Charmrun> IBVERBS version of charmrun
Charmrun> Error 255 returned from remote shell (golub162:0)
Charmrun> Reconnection attempt 1 of 3
Charmrun> Error 255 returned from remote shell (golub161:1)
Charmrun> Reconnection attempt 1 of 3
Charmrun> Error 255 returned from remote shell (golub162:0)
Charmrun> Reconnection attempt 2 of 3
Charmrun> Error 255 returned from remote shell (golub161:1)
Charmrun> Reconnection attempt 2 of 3
Charmrun> Error 255 returned from remote shell (golub162:0)
Charmrun> Reconnection attempt 3 of 3
Charmrun> Error 255 returned from remote shell (golub161:1)
Charmrun> Reconnection attempt 3 of 3
Charmrun> Error 255 returned from remote shell (golub162:0)
Charmrun> Too many reconnection attempts; bailing out

This is a recent issue, I remember it working fine a couple of months back. I think this is an issue with Golub/Taub, as ++nodelist works fine with lab machines.

The alternative solution is to launch jobs with ++mpiexec. I think this bug resolution should at least document about ++mpiexec as an alternative way of launching the job (and probably document the connection failure on Golub/Taub as a known issue).

History

#1 Updated by Eric Bohm 2 months ago

  • Assignee set to Evan Ramos

#2 Updated by Evan Ramos 2 months ago

Would it be possible to grant me access to Golub/Taub so I can test this directly?

Also available in: Atom PDF