Sobol results changes for server ranks > 0 (Develop Branch)

I am testing the main branch (ZMQ) with SA for result consistency when we change the server ranks. Below you will find that the results do not change in case of non-sobol runs while the sobol error increases proportionally to the number of server ranks.

Basic config for all runs

{
    "server_filename": "heatpde_sa_server.py",
    "server_class": "HeatPDEServerSA",
    "output_dir": "STUDY_OUT",
    "study_options": {
        "field_names": [
            "temperature"
        ],
        "parameter_sweep_size": 10,
        "num_samples": 100,
        "nb_parameters": 2,
        "parameter_range": [100, 500],
        "simulation_timeout": 10,
        "crashes_before_redraw": 1000,
        "verbosity": 2
    },
    "sa_config": {
        "mean": true,
        "variance": true,
        "skewness": true,
        "kurtosis": true,
        "sobol_indices": false,
        "checkpoint_interval": 50
    },
    "launcher_config": {
        "scheduler": "openmpi",
        "scheduler_arg_client": ["-n", "1","--timeout", "60", "-x", "MELISSA_COUPLING=1"],
        "scheduler_arg_server": ["-n", "1","--timeout", "3600"],
        "fault_tolerance": false,
        "job_limit": 2,
        "timer_delay": 1,
        "verbosity": 2
    }

L2 norm

def calculate_l2_norm(file1, file2):

    n1 = np.loadtxt(file1, dtype=np.float64)
    n2 = np.loadtxt(file2, dtype=np.float64)
    first_n = min(len(n1), len(n2))
    l2_norm = np.sqrt(np.sum((n1[:first_n] - n2[:first_n]) ** 2))

    print(f"file={os.path.split(file1)[-1]}\t"
          f"l2-norm={l2_norm:.3e}")
    return l2_norm

I store the results in separate directories and then use the function below to compute l2 norms for each file with same names.

>>> ls
l2norm.py  nonsobol_ranks_1  nonsobol_ranks_4  sobol_ranks_1  sobol_ranks_4


>>> python3 l2norm.py nonsobol_ranks_1 nonsobol_ranks_4
Dir 1: nonsobol_ranks_1
Dir 2: nonsobol_ranks_4

file=results.temperature_skewness.100	l2-norm=0.000e+00
file=results.temperature_kurtosis.001	l2-norm=0.000e+00
file=results.temperature_variance.100	l2-norm=0.000e+00
file=results.temperature_mean.100	l2-norm=0.000e+00
file=results.temperature_mean.001	l2-norm=0.000e+00
file=results.temperature_variance.001	l2-norm=0.000e+00
file=results.temperature_kurtosis.100	l2-norm=0.000e+00
file=results.temperature_skewness.001	l2-norm=0.000e+00



>>> python3 l2norm.py sobol_ranks_1 sobol_ranks_4
Dir 1: sobol_ranks_1
Dir 2: sobol_ranks_4

file=results.temperature_sobol0.001	l2-norm=3.108e+00
file=results.temperature_sobol_tot1.001	l2-norm=4.139e+00
file=results.temperature_sobol_tot0.100	l2-norm=4.139e+00
file=results.temperature_sobol1.100	l2-norm=2.532e+00
file=results.temperature_skewness.100	l2-norm=0.000e+00
file=results.temperature_kurtosis.001	l2-norm=0.000e+00
file=results.temperature_sobol_tot0.001	l2-norm=4.139e+00
file=results.temperature_variance.100	l2-norm=0.000e+00
file=results.temperature_mean.100	l2-norm=0.000e+00
file=results.temperature_mean.001	l2-norm=0.000e+00
file=results.temperature_sobol1.001	l2-norm=3.815e+00
file=results.temperature_variance.001	l2-norm=0.000e+00
file=results.temperature_kurtosis.100	l2-norm=0.000e+00
file=results.temperature_sobol_tot1.100	l2-norm=4.139e+00
file=results.temperature_skewness.001	l2-norm=0.000e+00
file=results.temperature_sobol0.100	l2-norm=2.533e+00

Vimdiff

First two values are different for rank > 0.
In this case, Each server rank receives (10000 / 4) values.

I run a single group of 4 clients for 1 and 2 server ranks. This is how the communication is taking place with 2 rank server where rank 0 only takes the top-half of the mesh and rank 1 the bottom-half. I see no phantom cells being sent before computing the stats after comparing 2 rank server study with a single rank server.

print(X.shape) # X contains 0th timestep serialized for a single rank server study.
print(Y1.shape) # Y1 contains 0th timestep serialized top-half arrays at server rank 0
print(Y2.shape) # Y2 contains 0th timestep serialized bottom-half arrays at server rank 1
i1 = 0
i2 = 0
for _ in range(4):
    print((X[i1:i1 + 10000] == np.concatenate((Y1[i2:i2 + 5000], Y2[i2:i2 + 5000]), axis=0)).all())
    i1 += 10000
    i2 += 5000


(40000,)
(20000,)
(20000,)
True
True
True
True

Another thing I noticed is that in sensitivity_analysis_server.py (regardless of the number of server ranks),

always update the first two values w.r.t each rank’s d_buffer. That is why, we probably see the difference in values between single-rank server and multi-rank server studies (vimdiff). I guess even the single rank server’s first two sobol result values could be wrong as well.

The main issue with these high errors was caused due to the incorrect usage of pearson_A and pearson_B attributes of IterativeSensitivityMartinez. Seems the previously developers just copy-pasted the way other moments were gathered and forgot to access these matrices across one single dimension like pearson_A[param] which was only updating the first two values in each rank’s buffers which were storing kurtosis values. By doing, pearson_A[:, param], we ensure we update all values in the buffer.

But, we still need to update the iterative_stats to ensure that the memory access is contiguous across each param. Right now, [:, param] causes column-wise slicing which is very bad.