Strange behaviour of Adios with mpirun (when group size > 1 for sobol)

AbhishekP · April 4, 2024, 3:17pm

Hi folks,

I am currently working on implementing Adios with sobol. As our Adios writers are sending the data independently we can do the grouping and then compute the sobol’ indices. But before jumping on the actual coding part. I have seen some strange behaviour with the OpenMPI. When sobol is set, the launcher submits the following command per group ,

mpirun <args1> <client1> : <args1> <client2> : ...

When we launch Adios writers likewise, All of the writers will start writing, and send the messages as if they all assume a successful handshake with all the readers was established.

However, on the reader side, only the client 0 (first simulation in each group) gets a successful handshake. So, for the client 0, all time steps will be delivered to the server and the reader will keep looking for the engine files of the writers that have already assumed a successful delivery and have closed their respective files and will eventually timeout.

I have recreated this bug with a simple reader/writer scenario

===EDIT===

In the shared reader/writer example, I split the MPI world communicator for each process and now it works fine.

But, in our heatc executables, the splitting occurs for each writer process anyway and the study fails regardless. This needs more investigation.

===EDIT===

But, can we ignore this way of executing the mpirun and submit client scripts separately for a group ?

~Abhishek.

Topic		Replies	Views
Adios integration General	4	67	September 23, 2024
Statistical differences noticed from ZMQ -> ADIOS2 Sensitivity Analysis	3	36	May 9, 2024
Sobol results changes for server ranks > 0 (Develop Branch) Bugs and Error Reporting	2	19	October 10, 2024
Melissa Virtual Cluster Issues Bugs and Error Reporting	2	37	March 5, 2024
OAR fails with deep learning heat-pde example Bugs and Error Reporting	4	30	March 11, 2024

Strange behaviour of Adios with mpirun (when group size > 1 for sobol)

===EDIT===

===EDIT===

Related topics