Project

General

Profile

Bug #1711

syncft tests: unclear failure

Added by Phil Miller 9 months ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
10/10/2017
Due date:
% Done:

0%


Description

http://ppl-jenkins:8080/job/Nightly-Build/label=trusty,platform=net-linux-x86_64-syncft/1338/console

../../../bin/testrun  ./jacobi 4 2 2 200 +vp16 +p8 +balancer DummyLB +isomalloc_sync +killFile kill_02.txt  
DISPLAY "(null)" invalid; disabling X11 forwarding
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.863 seconds.
Converse/Charm++ Commit ID: d62df12
Charm++> synchronizing isomalloc memory region...
Program finished after 655.080204 seconds.
Fatal socket error: code 93610-- Timeout on socket recv!
Charmrun> error on request socket to node 6 'localhost'--
Socket closed before recv.
Socket 11 failed 
DISPLAY "(null)" invalid; disabling X11 forwarding
charmrun says Processor 6 failed on Node 6
socket_index 7 crashed_node 6 reconnected fd 11  
Charmrun finished launching new process in 1.236517s
Program finished after 1256.395664 seconds.
Fatal socket error: code 93610-- Timeout on socket recv!
Charmrun> error on request socket to node 0 'localhost'--
Socket closed before recv.
Socket 9 failed 
DISPLAY "(null)" invalid; disabling X11 forwarding
charmrun says Processor 0 failed on Node 0
socket_index 5 crashed_node 0 reconnected fd 9  
ERROR> Charmrun detected multiple crashes.
Charmrun finished launching new process in 1.281417s
make[3]: Leaving directory `/scratch/jenkins/builds/Nightly-Build/label=trusty,platform=net-linux-x86_64-syncft@1338/charm/net-linux-x86_64-syncft/tests/ampi/jacobi3d'
make[3]: *** [syncfttest] Error 1
make[2]: *** [syncfttest] Error 1


Related issues

Related to Charm++ - Bug #1710: syncft tests: warning and crash on init_checkpt New 10/10/2017

History

#1 Updated by Phil Miller 9 months ago

  • Related to Bug #1710: syncft tests: warning and crash on init_checkpt added

#2 Updated by Phil Miller 9 months ago

Possibly similar / the same: http://ppl-jenkins:8080/job/Nightly-Build/label=trusty,platform=net-linux-x86_64-syncft/1304/console

../../../bin/testrun  ./jacobi 2 2 2 200 +vp8 +p8 +balancer DummyLB +isomalloc_sync +killFile kill_01.txt  
DISPLAY "(null)" invalid; disabling X11 forwarding
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.862 seconds.
Converse/Charm++ Commit ID: 54b77a7
Charm++> synchronizing isomalloc memory region...
Fatal socket error: code 93610-- Timeout on socket recv!
Program finished after 601.856377 seconds.
Charmrun> error on request socket to node 1 'localhost'--
Socket closed before recv.
Socket 5 failed 
DISPLAY "(null)" invalid; disabling X11 forwarding
charmrun says Processor 1 failed on Node 1
socket_index 1 crashed_node 1 reconnected fd 5  
Caught SIGPIPE.
Caught SIGPIPE.
Charmrun finished launching new process in 1.244774s
Program finished after 602.148137 seconds.
Caught SIGPIPE.
Caught SIGPIPE.
Fatal socket error: code 93610-- Timeout on socket recv!
Charmrun> error on request socket to node 7 'localhost'--
Socket closed before recv.
Socket 10 failed 
DISPLAY "(null)" invalid; disabling X11 forwarding
charmrun says Processor 7 failed on Node 7
socket_index 6 crashed_node 7 reconnected fd 10  
Charmrun finished launching new process in 1.229961s
ERROR> Charmrun detected multiple crashes.
make[3]: Leaving directory `/scratch/jenkins/builds/Nightly-Build/label=trusty,platform=net-linux-x86_64-syncft@1304/charm/net-linux-x86_64-syncft/tests/ampi/jacobi3d'
make[3]: *** [syncfttest] Error 1
make[2]: *** [syncfttest] Error 1

#3 Updated by Eric Bohm 9 months ago

  • Assignee set to Juan Galvez

#4 Updated by Sam White 4 months ago

I believe Juan gave an update on these in Core a week or two ago, that he couldn't reproduce them?

#5 Updated by Sam White 4 months ago

netlrts-linux-x86_64-syncft failed autobuild last night in tests/charm++/jacobi3d:

testrun  ./jacobi3d 256 256 256 64 64 32 +p7 +balancer DummyLB +killFile kill_02.txt  ++no-va-randomization
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.377 seconds.
Charm++> Running in non-SMP mode: numPes 7
Converse/Charm++ Commit ID: d5ee81e
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
[0] killFlag set to true for file kill_02.txt
CharmLB> DummyLB created.
Charm++> CkMemCheckPTInit mainchare is created!

STENCIL COMPUTATION WITH BARRIERS
Running Jacobi on 7 processors with (4, 4, 8) chares
Array Dimensions: 256 256 256
[3] To be killed after 35.000000 s (MEMCKPT) 
Block Dimensions: 64 64 32
Start of iteration 0 at 0.030232
Start of iteration 10 at 0.824019
Start of iteration 20 at 1.505496
Start of iteration 30 at 2.175031
Start of iteration 40 at 2.854789
Start of iteration 50 at 3.514263
Start of iteration 60 at 4.161078
Start of iteration 70 at 4.789726
Start of iteration 80 at 5.517502
Start of iteration 90 at 6.144269
[0] Start checkpointing  starter: 0... 
[0] Checkpoint finished in 0.371149 seconds, sending callback ... 
Start of iteration 100 at 7.166621
[0] Checkpoint Processor data: 4085 
Start of iteration 110 at 7.814572
Start of iteration 120 at 8.470773
Start of iteration 130 at 9.170448
Start of iteration 140 at 9.854583
Start of iteration 150 at 10.534283
Start of iteration 160 at 11.299033
Start of iteration 170 at 11.926815
Start of iteration 180 at 12.649926
Start of iteration 190 at 13.344304
[0] Start checkpointing  starter: 0... 
[0] Checkpoint finished in 0.404142 seconds, sending callback ... 
Start of iteration 200 at 14.490503
[0] Checkpoint Processor data: 4085 
Start of iteration 210 at 15.196615
Start of iteration 220 at 15.888155
Start of iteration 230 at 16.644942
Start of iteration 240 at 17.349082
Start of iteration 250 at 18.029976
Start of iteration 260 at 18.759734
Start of iteration 270 at 19.604271
Start of iteration 280 at 20.310760
Start of iteration 290 at 21.018600
[0] Start checkpointing  starter: 0... 
[0] Checkpoint finished in 0.438347 seconds, sending callback ... 
Start of iteration 300 at 22.143916
[0] Checkpoint Processor data: 4085 
Start of iteration 310 at 22.856771
Start of iteration 320 at 23.551625
Start of iteration 330 at 24.294874
Start of iteration 340 at 25.016657
Start of iteration 350 at 25.706305
Start of iteration 360 at 26.424058
Start of iteration 370 at 27.235238
Start of iteration 380 at 27.974727
Start of iteration 390 at 28.824218
[0] Start checkpointing  starter: 0... 
[0] Checkpoint finished in 0.345677 seconds, sending callback ... 
Start of iteration 400 at 29.870442
[0] Checkpoint Processor data: 4085 
Start of iteration 410 at 30.584224
Start of iteration 420 at 31.264333
Start of iteration 430 at 32.154231
Start of iteration 440 at 32.874257
Start of iteration 450 at 33.614286
Start of iteration 460 at 34.394219
Charmrun> error on request socket to node 3 'localhost'--
Socket closed before recv.
Socket 7 failed 
Charmrun> All hosts crashed, aborting.

#6 Updated by Sam White 4 months ago

This has failed the past 3 nights in autobuild.

#7 Updated by Juan Galvez 4 months ago

The autobuild failures are caused by this commit: "Charmrun: Distribute PEs among hosts using two phases of communication" baf85b535465c1e58e355bc4e1892b4d0e49d364

They weren't appearing earlier because the syncft tests usually finish before a process is killed, and therefore before any actual restart from checkpoint is attempted.

But these autobuild failures of the last few days are probably unrelated to the bug I'm pursuing in this issue, because the issue was reported months before the above commit.

#8 Updated by Juan Galvez 4 months ago

  • Status changed from New to In Progress

#9 Updated by Sam White 4 months ago

From Jenkins syncft build:

Start of iteration 90 at 25.647321
[0] Start checkpointing  starter: 0... 
Charmrun> error on request socket to node 3 'localhost'--
Socket closed before recv.
Socket 9 failed 
DISPLAY "(null)" invalid; disabling X11 forwarding
charmrun says process 3 failed (on host localhost)
crashed_node 3 reconnected fd 9  
Charmrun finished launching new process in 1.557419s
Charmrun> continue node: 3
[3] Restarting after crash 
[3] I am restarting  cur_restart_phase:2 at time: 0.000691
[3] I am restarting  cur_restart_phase:2 discard charm message at time: 0.000745
[4] askProcDataHandler called with '3' cur_restart_phase:2 at time 36.692923.
[4] no checkpoint found for processor 3. This could be due to a crash before the first checkpointing.
------------- Processor 4 Exiting: Called CmiAbort ------------
Reason: no checkpoint found
[4] Stack Traceback:
  [4:0] CmiAbortHelper+0x63  [0x58cef3]
  [4:1]   [0x536e96]
  [4:2] CsdScheduleForever+0x50  [0x5932c0]
  [4:3] CsdScheduler+0x2d  [0x5935bd]
  [4:4] ConverseInit+0x45a  [0x59123a]
  [4:5] main+0x21  [0x48ae21]
  [4:6] __libc_start_main+0xf5  [0x7ffff7212f45]
  [4:7]   [0x48b6fd]
Fatal error on PE 4> no checkpoint found

#10 Updated by Juan Galvez 4 months ago

I'm trying to replicate the recent errors observed in Jenkins, but haven't been able. I ran the same test case that failed, in a bash loop with 50 iterations, and artificial background CPU load. Tests run successfully.

I did notice that when checkpointing there is a long pause, but in my case the program always continues successfully. Looking at the output of the failed Jenkins build, it seems that the test is running extremely slowly (maybe CPU is overloaded running other stuff?), so my guess is that there is nothing wrong with the syncft test, and maybe charmrun is just experiencing a socket timeout while the PE is checkpointing and thinks it has crashed when it hasn't.

#11 Updated by Sam White 3 months ago

  • Target version deleted (6.9.0)

Also available in: Atom PDF