Project

General

Profile

Support #1674

Add 'ofi' target to autobuild

Added by Sam White 9 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Build & Test Automation
Target version:
Start date:
09/11/2017
Due date:
% Done:

0%


Related issues

Blocked by Charm++ - Bug #1738: Failure at LrtsInit with OFI build with verbs provider on Golub New 11/06/2017

History

#1 Updated by Sam White 9 months ago

  • Assignee set to Jaemin Choi

ofi builds could be added on the same target machine (campus cluster) as verbs, so assigning to Jaemin.

#2 Updated by Sam White 9 months ago

Bridges at PSC would be an ideal candidate for autobuild, if we can set it up to run without 2-factor authentication.

#3 Updated by Jim Phillips 9 months ago

Bridges doesn't use 2-factor authentication.

#4 Updated by Phil Miller 9 months ago

  • Tracker changed from Bug to Support

#5 Updated by Sam White 8 months ago

  • Target version set to 6.9.0
  • Priority changed from Normal to High

Is there a problem with ofi on golub? This needs to get done

#6 Updated by Sam White 8 months ago

Bump. Any reason to not add this on golub?

#7 Updated by Jaemin Choi 8 months ago

I've been working on getting it running on golub, but there is an issue with using ++mpiexec, where if you set ppn in qsub larger than the total number of PEs used in a test program it uses only a single physical node. This is problematic with SMP buidls, and it affects the verbs autobuilds as well. I've found a workaround by setting ppn to 1 in qsub and create a nodelist out of PBS_NODEFILE and using that, so I'll go forward with it.

#8 Updated by Jaemin Choi 8 months ago

  • Blocked by Bug #1738: Failure at LrtsInit with OFI build with verbs provider on Golub added

#9 Updated by Jaemin Choi 8 months ago

  • Status changed from New to In Progress

#10 Updated by Sam White 8 months ago

We want this to be on Bridges since that has an Omni-Path interconnect and we know OFI works there. We need to ensure we have enough allocation to run autobuild on it.

#11 Updated by Sam White 6 months ago

Bump, this needs to get done

#12 Updated by Jaemin Choi 6 months ago

I don't have allocation anymore on Bridges, but Karthik does.
We'll try using his account to set up autobuild there.

#13 Updated by Jaemin Choi 5 months ago

  • Status changed from In Progress to Resolved

#14 Updated by Sam White 5 months ago

  • Status changed from Resolved to In Progress

Autobuild for OFI is not passing yet. I don't even think it has actually gotten to building charm yet. Here's last night's run's failure:

./instead_test.sh: line 15: cd: charm/ofi-linux-x86_64/tmp: No such file or directory

#15 Updated by Sam White 5 months ago

Last night autobuild was able to log onto Bridges successfully but failed to unzip charm:

remote> gunzip -f charm.tar.gz
gzip: charm.tar.gz: No such file or directory

#16 Updated by Sam White 5 months ago

The build works, but then the jobs are pretty consistently timing out for whatever reason now:

In testdir charm/ofi-linux-x86_64/tmp
Submitting batch job for> make test OPTS=
 using the command> sbatch /home/skk3/autobuild/ofi/charmrun_script.31865.sh
Job enqueued under job ID 2272723
Job in state 
Job in state RUNNING
Job in state RUNNING
Job in state RUNNING
...
...
Job in state TIMEOUT
Job in state TIMEOUT
Job in state TIMEOUT
...
...

#17 Updated by Sam White 5 months ago

Once we get the non-SMP build running, we'll want to add a second target that is SMP

#18 Updated by Jaemin Choi 5 months ago

The issue of "There seems to be an issue with the OFI build that +p1 passed to an application is regarded as argv[1], and the pingpong benchmark (tests/charm++/pingpong) with ./pgm +p1 hangs as it tries to use +p1 as the payload which is ultimately set to 0.", which I thought was resolved by https://charm.cs.illinois.edu/gerrit/#/c/3452/, seems to have resurfaced.

#19 Updated by Jaemin Choi 5 months ago

Actually the problem this time doesn't seem to be caused from +p1; the command that causes the hang is ../../../bin/testrun ./pgm +p1 ++timeout 180 +isomalloc_sync, and the ++timeout 180 part is the culprit (so removing this works). But I think this problem happens whenever something not parsable is passed, because even +timeout 180 causes the same hang. And the same thing if I use charmrun instead of ../../../bin/testrun.

#20 Updated by Sam White 5 months ago

We still need +isomalloc_sync. The tests ran last night but failed in an AMPI test that needs that flag.

#21 Updated by Sam White 5 months ago

I added +isomalloc_sync and it passed last night. ALl that is need now is to add an SMP target.

#22 Updated by Jaemin Choi 5 months ago

Added SMP target to system_list and created ofi-smp folder along with instead_test.sh on Bridges.

#23 Updated by Sam White 4 months ago

  • Status changed from In Progress to Resolved

ofi non-SMP and SMP passed yesterday. The SMP build seems to oftenhang, so that should still be monitored and addressed, but we can mark this resolved now.

Also available in: Atom PDF