Bug in the relaunch_group method

Hello,

there is a bug in the relaunch_group method of the BaseServer (see base_server.py) :

self.groups[group_id].simulations[sim_id].parameters,

should be

self.groups[group_id].simulations[group_id * self.group_size + sim_id].parameters,

Sincerely,

Théophile Terraz

Hi Théophile,

That’s exact, thank you very much.
MR 116! has been created to fix this.

Hello,

there is one more bug in the relaunch_group method of the BaseServer (see base_server.py) :

del self.groups[group_id]

This line should not exist. It removes the group_idth item from the groups list, shifts all the end of the list by -1 and screw up the study. Without it, the relaunch system works well.

Sincerely,

Théophile Terraz

Hi Théophile,

Thank you for your input.

I am a bit confused on how this command breaks the study, could you please elaborate a bit further on the context in which you encountered an issue?

For the record, this deletion was introduced to make sure that if a group is relaunched from scratch (i.e. with new inputs), its associated simulations are re-instantiated with new Simulation objects (this ensures received_simulation_data and received_time_steps are empty dictionaries again for instance).

In addition, as you probably noticed, groups is a dictionary. Therefore, although removing the group_id entry will effectively shift the dictionary entries order, the keys are preserved. Since the deletion you are referring to is followed by this:

for sim in range(self.group_size):
    self.generate_client_scripts(
        group_id * self.group_size + sim, 1, None
    )

the group_id key will immediately be re-introduced amongst the dictionary entries. I hence don’t understand how the change in keys order can become a problem.

Are you sure that’s where the bug lies?

Hello,

I had a kind of deadlock that does not exist anymore without this line: the server was running until the walltime without error but without results. I thougt groups was a list so I came up with that explanation, but in fact I did not investigate further. But you are right, with my solution we keep the first timesteps of the failed simulations as “recieved”.

As you know it is very cumbersome to thoroughly test fault-tolerance features so thank you very much for your feedback.

If there is a chance of running into a deadlock situation, that’s concerning…

I do not know the degree of complexity of your working environment but if there is an easy way for us to reproduce this bug, feel free to let us know so that we can investigate it in more details.

Regards,
Marc

Hello,

I found the deadlock and why commenting the “del” solved it:
If you delete the group, then you will enter in this 'if" section ans then increment “self.n_submitted_simulations”.
Then, you will never enter this “if” section, causing the server to wait indefinitely.
One dirty fix is to add:

self.n_submitted_simulations -= 1

before the call to “self.generate_client_scripts” in “relaunch_group.py”.

Sincerely,

Théophile Terraz

Very good catch! We’ll fix this ASAP, thank you very much :grinning:

edit: MR !126 was created to fix this.