Bug in the relaunch_group method

tterraz · May 3, 2023, 1:06pm

Hello,

there is a bug in the relaunch_group method of the BaseServer (see base_server.py) :

self.groups[group_id].simulations[sim_id].parameters,

should be

self.groups[group_id].simulations[group_id * self.group_size + sim_id].parameters,

Sincerely,

Théophile Terraz

mschoule · May 3, 2023, 2:00pm

Hi Théophile,

That’s exact, thank you very much.
MR 116! has been created to fix this.

tterraz · May 11, 2023, 3:34pm

Hello,

there is one more bug in the relaunch_group method of the BaseServer (see base_server.py) :

del self.groups[group_id]

This line should not exist. It removes the group_idth item from the groups list, shifts all the end of the list by -1 and screw up the study. Without it, the relaunch system works well.

Sincerely,

Théophile Terraz

mschoule · May 12, 2023, 7:45am

Hi Théophile,

Thank you for your input.

I am a bit confused on how this command breaks the study, could you please elaborate a bit further on the context in which you encountered an issue?

For the record, this deletion was introduced to make sure that if a group is relaunched from scratch (i.e. with new inputs), its associated simulations are re-instantiated with new Simulation objects (this ensures received_simulation_data and received_time_steps are empty dictionaries again for instance).

In addition, as you probably noticed, groups is a dictionary. Therefore, although removing the group_id entry will effectively shift the dictionary entries order, the keys are preserved. Since the deletion you are referring to is followed by this:

for sim in range(self.group_size):
    self.generate_client_scripts(
        group_id * self.group_size + sim, 1, None
    )

the group_id key will immediately be re-introduced amongst the dictionary entries. I hence don’t understand how the change in keys order can become a problem.

Are you sure that’s where the bug lies?

tterraz · May 12, 2023, 8:33am

Hello,

I had a kind of deadlock that does not exist anymore without this line: the server was running until the walltime without error but without results. I thougt groups was a list so I came up with that explanation, but in fact I did not investigate further. But you are right, with my solution we keep the first timesteps of the failed simulations as “recieved”.

mschoule · May 12, 2023, 9:18am

As you know it is very cumbersome to thoroughly test fault-tolerance features so thank you very much for your feedback.

If there is a chance of running into a deadlock situation, that’s concerning…

I do not know the degree of complexity of your working environment but if there is an easy way for us to reproduce this bug, feel free to let us know so that we can investigate it in more details.

Regards,
Marc

tterraz · June 2, 2023, 1:32pm

Hello,

I found the deadlock and why commenting the “del” solved it:
If you delete the group, then you will enter in this 'if" section ans then increment “self.n_submitted_simulations”.
Then, you will never enter this “if” section, causing the server to wait indefinitely.
One dirty fix is to add:

self.n_submitted_simulations -= 1

before the call to “self.generate_client_scripts” in “relaunch_group.py”.

Sincerely,

Théophile Terraz

mschoule · June 5, 2023, 7:14am

Very good catch! We’ll fix this ASAP, thank you very much

edit: MR !126 was created to fix this.

Topic		Replies	Views
New Version is Out! Software Updates	0	13	June 17, 2025
Strange behaviour of Adios with mpirun (when group size > 1 for sobol) Bugs and Error Reporting	0	29	April 4, 2024
Sobol results changes for server ranks > 0 (Develop Branch) Bugs and Error Reporting	2	26	October 10, 2024
Client API refactored in v2.1.0 Software Updates	2	13	August 5, 2025
OAR fails with deep learning heat-pde example Bugs and Error Reporting	4	36	March 11, 2024

Bug in the relaunch_group method

Related topics