PMPI: Record times and sizes of sent/received messages
Opening this so that the work is visible, both to external users and for internal workload balancing.
#4 Updated by Osman Sarood about 5 years ago
Here is the list of changes I am planning to make:
1. MPI_Recv: Currently I am adding the '1 5 0...' row in the trace for a message. I think we should only add it for the MPI_Send. For MPI_Recv we only record the msg size and the src rank in the following BEGIN_PROCESSING block.
2. I will replicate similar behavior for MPI_Isend/MPI_ssend and MPI_Irecv.
3. MPI_Reduce: record the message send for each process to the root by putting appropriate '1 5 0..' entries. Once the reduction is complete we just record the msg size which is the length of one reduction message (?). What should be the source of this message on the root PE (and other ranks as well)?
4. MPI_AllReduce: similar to MPI_Reduce but we should have a bunch of sends for each rank for distributing the result which would be reflected in the BEGIN_PROCESSING for each rank after the reduction is complete (that includes msg size and src rank).
5. MPI_Bcast: These are simply 'n' message sends and receives and should be treated as MPI_Send/MPI_Recvs. We just need to distinguish the root from everyone else.
#5 Updated by Osman Sarood over 4 years ago
In order to tack collectives, we need to keep track of all the messages sent by MPI. The underlying MPI implementation might be using hypercube or some other technique for the bcast/multicast. Ideally we should hack what messaging pattern is adopted and record messages as it is.
However, a simpler but imperfect way could be that we just identify the root and the members of communicator (to which the message is being broadcasted) and add a message entry for them. We should also report the message size with it. We need to find out a way of how to access the members of a MPI communicator. I can think of a few ways that requires communication and hence are useless i.e. we might not want to do communication just for logging events. We need to keep in mind that such a mechanism that is just assuming a hypercube or any other technique for the bcast will NOT show the real picture and might have limited utility as it might be showing the incorrect information.
The correct way of going about it is to get the actual pattern in which communication is done for collective and accordingly insert entries into PMPI projections. To do that, we need to figure out a way of detecting the communication pattern for a particular MPI installation.