Project

General

Profile

Support #2031

Add a new target (potentially benchmarks) for Vesta autobuilds to avoid maximum execution time from exceeding

Added by Evan Ramos 7 months ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Build & Test Automation
Target version:
Start date:
11/19/2018
Due date:
% Done:

0%



Related issues

Related to Charm++ - Cleanup #1872: Move performance tests and benchmarks from "make test" to a new "make benchmark" Implemented 04/17/2018

History

#1 Updated by Eric Bohm 7 months ago

  • Assignee set to Nitin Bhat

#2 Updated by Eric Bohm 3 months ago

  • Target version changed from 6.9.1 to 6.10.0

#3 Updated by Evan Ramos 3 months ago

This bug should be reassigned to a PPL member with BG/Q access

#4 Updated by Nitin Bhat 3 months ago

  • Status changed from New to In Progress

Since the BGQ autobuilds began successfully running just yesterday, I am going to wait for about 10 days to ensure this issue doesn't reoccur.

#5 Updated by Nitin Bhat 3 months ago

On examining the job logs for the reported failures, I found out that all of these are due to exceeding the maximum execution time (as shown in the joblogs below).

Jobid: 708986
qsub -A CharmRTS -q default -n 1 -t 90 -o /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.stdout -e /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.error --debuglog /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.joblog --mode script /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.sh
Sun Nov 11 10:56:26 2018 +0000 (UTC) submitted with cwd set to: /gpfs/vesta-fs0/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charm/pamilrts-bluegeneq-smp/tmp
Sun Nov 11 10:57:03 2018 +0000 (UTC) nbhat/708986: Initiating boot at location VST-00240-11351-32.
Sun Nov 11 10:57:20 2018 +0000 (UTC)
Sun Nov 11 10:57:20 2018 +0000 (UTC) Command: '/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.sh'
Sun Nov 11 10:57:20 2018 +0000 (UTC)
Sun Nov 11 10:57:20 2018 +0000 (UTC) Environment:
Sun Nov 11 10:57:20 2018 +0000 (UTC) SHELL=/bin/bash
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_NODEFILE=/tmp/tmp_PCnCm
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_PARTNAME=VST-00240-11351-32
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_JOBID=708986
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_STARTTIME=1541933790
Sun Nov 11 10:57:20 2018 +0000 (UTC) LOGNAME=nbhat
Sun Nov 11 10:57:20 2018 +0000 (UTC) USER=nbhat
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_ENDTIME=1541939190
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_PARTSIZE=32
Sun Nov 11 10:57:20 2018 +0000 (UTC) HOME=/home/nbhat
Sun Nov 11 10:57:20 2018 +0000 (UTC) COBALT_JOBSIZE=1
Sun Nov 11 10:57:20 2018 +0000 (UTC)
Sun Nov 11 10:57:20 2018 +0000 (UTC) Info: stdin received from /dev/null
Sun Nov 11 10:57:20 2018 +0000 (UTC) Info: stdout sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.stdout
Sun Nov 11 10:57:20 2018 +0000 (UTC) Info: stderr sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.31693.error
Sun Nov 11 10:57:20 2018 +0000 (UTC)
Sun Nov 11 10:57:24 2018 +0000 (UTC) nbhat/708986: Block VST-00240-11351-32 for location VST-00240-11351-32 successfully booted (Initiating).
Sun Nov 11 12:26:35 2018 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination
Jobid: 709062
qsub -A CharmRTS -q default -n 1 -t 90 -o /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.stdout -e /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.error --debuglog /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.joblog --mode script /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.sh
Mon Nov 12 11:48:56 2018 +0000 (UTC) submitted with cwd set to: /gpfs/vesta-fs0/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charm/pamilrts-bluegeneq/tmp
Mon Nov 12 11:49:23 2018 +0000 (UTC) nbhat/709062: Initiating boot at location VST-20060-31171-32.
Mon Nov 12 11:49:41 2018 +0000 (UTC)
Mon Nov 12 11:49:41 2018 +0000 (UTC) Command: '/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.sh'
Mon Nov 12 11:49:41 2018 +0000 (UTC)
Mon Nov 12 11:49:41 2018 +0000 (UTC) Environment:
Mon Nov 12 11:49:41 2018 +0000 (UTC) SHELL=/bin/bash
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_NODEFILE=/tmp/tmpphhq4s
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_PARTNAME=VST-20060-31171-32
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_JOBID=709062
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_STARTTIME=1542023339
Mon Nov 12 11:49:41 2018 +0000 (UTC) LOGNAME=nbhat
Mon Nov 12 11:49:41 2018 +0000 (UTC) USER=nbhat
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_ENDTIME=1542028739
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_PARTSIZE=32
Mon Nov 12 11:49:41 2018 +0000 (UTC) HOME=/home/nbhat
Mon Nov 12 11:49:41 2018 +0000 (UTC) COBALT_JOBSIZE=1
Mon Nov 12 11:49:41 2018 +0000 (UTC)
Mon Nov 12 11:49:41 2018 +0000 (UTC) Info: stdin received from /dev/null
Mon Nov 12 11:49:41 2018 +0000 (UTC) Info: stdout sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.stdout
Mon Nov 12 11:49:41 2018 +0000 (UTC) Info: stderr sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42606.error
Mon Nov 12 11:49:41 2018 +0000 (UTC)
Mon Nov 12 11:49:44 2018 +0000 (UTC) nbhat/709062: Block VST-20060-31171-32 for location VST-20060-31171-32 successfully booted (Initiating).
Mon Nov 12 13:19:06 2018 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination
Jobid: 709250
qsub -A CharmRTS -q default -n 1 -t 90 -o /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.stdout -e /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.error --debuglog /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.joblog --mode script /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.sh
Wed Nov 14 10:53:57 2018 +0000 (UTC) submitted with cwd set to: /gpfs/vesta-fs0/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charm/pamilrts-bluegeneq-smp/tmp
Wed Nov 14 10:54:34 2018 +0000 (UTC) nbhat/709250: Initiating boot at location VST-22260-33371-32.
Wed Nov 14 10:54:54 2018 +0000 (UTC)
Wed Nov 14 10:54:54 2018 +0000 (UTC) Command: '/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.sh'
Wed Nov 14 10:54:54 2018 +0000 (UTC)
Wed Nov 14 10:54:54 2018 +0000 (UTC) Environment:
Wed Nov 14 10:54:54 2018 +0000 (UTC) SHELL=/bin/bash
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_NODEFILE=/tmp/tmpIJX2lM
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_PARTNAME=VST-22260-33371-32
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_JOBID=709250
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_STARTTIME=1542192845
Wed Nov 14 10:54:54 2018 +0000 (UTC) LOGNAME=nbhat
Wed Nov 14 10:54:54 2018 +0000 (UTC) USER=nbhat
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_ENDTIME=1542198245
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_PARTSIZE=32
Wed Nov 14 10:54:54 2018 +0000 (UTC) HOME=/home/nbhat
Wed Nov 14 10:54:54 2018 +0000 (UTC) COBALT_JOBSIZE=1
Wed Nov 14 10:54:54 2018 +0000 (UTC)
Wed Nov 14 10:54:54 2018 +0000 (UTC) Info: stdin received from /dev/null
Wed Nov 14 10:54:54 2018 +0000 (UTC) Info: stdout sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.stdout
Wed Nov 14 10:54:54 2018 +0000 (UTC) Info: stderr sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-smp/charmrun_script.9348.error
Wed Nov 14 10:54:54 2018 +0000 (UTC)
Wed Nov 14 10:54:55 2018 +0000 (UTC) nbhat/709250: Block VST-22260-33371-32 for location VST-22260-33371-32 successfully booted (Initiating).
Wed Nov 14 12:24:11 2018 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination
Jobid: 709400
qsub -A CharmRTS -q default -n 1 -t 90 -o /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.stdout -e /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.error --debuglog /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.joblog --mode script /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.sh
Fri Nov 16 10:46:55 2018 +0000 (UTC) submitted with cwd set to: /gpfs/vesta-fs0/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charm/pamilrts-bluegeneq/tmp
Fri Nov 16 10:47:30 2018 +0000 (UTC) nbhat/709400: Initiating boot at location VST-20060-31171-32.
Fri Nov 16 10:47:47 2018 +0000 (UTC)
Fri Nov 16 10:47:47 2018 +0000 (UTC) Command: '/projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.sh'
Fri Nov 16 10:47:47 2018 +0000 (UTC)
Fri Nov 16 10:47:47 2018 +0000 (UTC) Environment:
Fri Nov 16 10:47:47 2018 +0000 (UTC) SHELL=/bin/bash
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_NODEFILE=/tmp/tmptlPbNS
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_PARTNAME=VST-20060-31171-32
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_JOBID=709400
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_STARTTIME=1542365218
Fri Nov 16 10:47:47 2018 +0000 (UTC) LOGNAME=nbhat
Fri Nov 16 10:47:47 2018 +0000 (UTC) USER=nbhat
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_ENDTIME=1542370618
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_PARTSIZE=32
Fri Nov 16 10:47:47 2018 +0000 (UTC) HOME=/home/nbhat
Fri Nov 16 10:47:47 2018 +0000 (UTC) COBALT_JOBSIZE=1
Fri Nov 16 10:47:47 2018 +0000 (UTC)
Fri Nov 16 10:47:47 2018 +0000 (UTC) Info: stdin received from /dev/null
Fri Nov 16 10:47:47 2018 +0000 (UTC) Info: stdout sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.stdout
Fri Nov 16 10:47:47 2018 +0000 (UTC) Info: stderr sent to /projects/CharmRTS/nbhat/autobuild/pamilrts-clang-nosmp/charmrun_script.42061.error
Fri Nov 16 10:47:47 2018 +0000 (UTC)
Fri Nov 16 10:47:51 2018 +0000 (UTC) nbhat/709400: Block VST-20060-31171-32 for location VST-20060-31171-32 successfully booted (Initiating).
Fri Nov 16 12:17:12 2018 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination

I have increased the execution time for all the build jobs to 120 minutes (which is the maximum allowed time for the queues that we have access to).
However, last night there was again a failure because of the maximum time exceeding: http://charm.cs.illinois.edu/autobuild/old.2019_03_26__01_01/pamilrts-bluegeneq-async-smp.txt

Jobid: 717537
qsub -A CharmRTS -q low -n 1 -t 120 -o /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.stdout -e /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.error --debuglog /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.joblog --mode script /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.sh
Tue Mar 26 10:02:07 2019 +0000 (UTC) submitted with cwd set to: /gpfs/vesta-fs0/projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charm/pamilrts-bluegeneq-async-smp/tmp
Tue Mar 26 10:02:43 2019 +0000 (UTC) jaemin/717537: Initiating boot at location VST-22060-33171-32.
Tue Mar 26 10:03:03 2019 +0000 (UTC)
Tue Mar 26 10:03:03 2019 +0000 (UTC) Command: '/projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.sh'
Tue Mar 26 10:03:03 2019 +0000 (UTC)
Tue Mar 26 10:03:03 2019 +0000 (UTC) Environment:
Tue Mar 26 10:03:03 2019 +0000 (UTC) SHELL=/bin/bash
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_NODEFILE=/tmp/tmpkebbcJ
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_PARTNAME=VST-22060-33171-32
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_JOBID=717537
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_STARTTIME=1553594537
Tue Mar 26 10:03:03 2019 +0000 (UTC) LOGNAME=jaemin
Tue Mar 26 10:03:03 2019 +0000 (UTC) USER=jaemin
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_ENDTIME=1553601737
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_PARTSIZE=32
Tue Mar 26 10:03:03 2019 +0000 (UTC) HOME=/home/jaemin
Tue Mar 26 10:03:03 2019 +0000 (UTC) COBALT_JOBSIZE=1
Tue Mar 26 10:03:03 2019 +0000 (UTC)
Tue Mar 26 10:03:03 2019 +0000 (UTC) Info: stdin received from /dev/null
Tue Mar 26 10:03:03 2019 +0000 (UTC) Info: stdout sent to /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.stdout
Tue Mar 26 10:03:03 2019 +0000 (UTC) Info: stderr sent to /projects/CharmRTS/jaemin/autobuild/pamilrts-clang-smp-async/charmrun_script.29545.error
Tue Mar 26 10:03:03 2019 +0000 (UTC)
Tue Mar 26 10:03:05 2019 +0000 (UTC) jaemin/717537: Block VST-22060-33171-32 for location VST-22060-33171-32 successfully booted (Initiating).
Tue Mar 26 12:02:25 2019 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination

^ This occurred even though we had set the execution time to the maximum allowed time i.e. 120 minutes. The output indicates that the program didn't hang at any stage. I suspect that some of the programs spuriously run slow and then cause the entire job to exceed the execution time. We already split the tests and programs into two separate 120 minute jobs (total of 240 mins). We may have to split this into another job or alternatively identify slow jobs and remove them from autobuild testing.

#6 Updated by Evan Ramos 3 months ago

My task to separate benchmarks from tests might help here.

#7 Updated by Nitin Bhat 3 months ago

  • Related to Cleanup #1872: Move performance tests and benchmarks from "make test" to a new "make benchmark" added

#8 Updated by Nitin Bhat 3 months ago

  • Subject changed from pamilrts-bluegeneq occasionally hangs randomly in autobuild to Add a new target (potentially benchmarks) for Vesta autobuilds to avoid maximum execution time from exceeding
  • Tracker changed from Bug to Support
  • Category set to Build & Test Automation

Yes, I think it'll be good to get back to this after the reorganization of tests, examples, and benchmarks.

This change is currently waiting on https://charm.cs.illinois.edu/gerrit/c/charm/+/4218 to be completed and merged.

Also available in: Atom PDF