Sobol results changes for server ranks > 0 (Develop Branch)

AbhishekP · June 26, 2024, 3:25pm

I am testing the main branch (ZMQ) with SA for result consistency when we change the server ranks. Below you will find that the results do not change in case of non-sobol runs while the sobol error increases proportionally to the number of server ranks.

Basic config for all runs

{
    "server_filename": "heatpde_sa_server.py",
    "server_class": "HeatPDEServerSA",
    "output_dir": "STUDY_OUT",
    "study_options": {
        "field_names": [
            "temperature"
        ],
        "parameter_sweep_size": 10,
        "num_samples": 100,
        "nb_parameters": 2,
        "parameter_range": [100, 500],
        "simulation_timeout": 10,
        "crashes_before_redraw": 1000,
        "verbosity": 2
    },
    "sa_config": {
        "mean": true,
        "variance": true,
        "skewness": true,
        "kurtosis": true,
        "sobol_indices": false,
        "checkpoint_interval": 50
    },
    "launcher_config": {
        "scheduler": "openmpi",
        "scheduler_arg_client": ["-n", "1","--timeout", "60", "-x", "MELISSA_COUPLING=1"],
        "scheduler_arg_server": ["-n", "1","--timeout", "3600"],
        "fault_tolerance": false,
        "job_limit": 2,
        "timer_delay": 1,
        "verbosity": 2
    }

L2 norm

def calculate_l2_norm(file1, file2):

    n1 = np.loadtxt(file1, dtype=np.float64)
    n2 = np.loadtxt(file2, dtype=np.float64)
    first_n = min(len(n1), len(n2))
    l2_norm = np.sqrt(np.sum((n1[:first_n] - n2[:first_n]) ** 2))

    print(f"file={os.path.split(file1)[-1]}\t"
          f"l2-norm={l2_norm:.3e}")
    return l2_norm

I store the results in separate directories and then use the function below to compute l2 norms for each file with same names.

>>> ls
l2norm.py  nonsobol_ranks_1  nonsobol_ranks_4  sobol_ranks_1  sobol_ranks_4


>>> python3 l2norm.py nonsobol_ranks_1 nonsobol_ranks_4
Dir 1: nonsobol_ranks_1
Dir 2: nonsobol_ranks_4

file=results.temperature_skewness.100	l2-norm=0.000e+00
file=results.temperature_kurtosis.001	l2-norm=0.000e+00
file=results.temperature_variance.100	l2-norm=0.000e+00
file=results.temperature_mean.100	l2-norm=0.000e+00
file=results.temperature_mean.001	l2-norm=0.000e+00
file=results.temperature_variance.001	l2-norm=0.000e+00
file=results.temperature_kurtosis.100	l2-norm=0.000e+00
file=results.temperature_skewness.001	l2-norm=0.000e+00



>>> python3 l2norm.py sobol_ranks_1 sobol_ranks_4
Dir 1: sobol_ranks_1
Dir 2: sobol_ranks_4

file=results.temperature_sobol0.001	l2-norm=3.108e+00
file=results.temperature_sobol_tot1.001	l2-norm=4.139e+00
file=results.temperature_sobol_tot0.100	l2-norm=4.139e+00
file=results.temperature_sobol1.100	l2-norm=2.532e+00
file=results.temperature_skewness.100	l2-norm=0.000e+00
file=results.temperature_kurtosis.001	l2-norm=0.000e+00
file=results.temperature_sobol_tot0.001	l2-norm=4.139e+00
file=results.temperature_variance.100	l2-norm=0.000e+00
file=results.temperature_mean.100	l2-norm=0.000e+00
file=results.temperature_mean.001	l2-norm=0.000e+00
file=results.temperature_sobol1.001	l2-norm=3.815e+00
file=results.temperature_variance.001	l2-norm=0.000e+00
file=results.temperature_kurtosis.100	l2-norm=0.000e+00
file=results.temperature_sobol_tot1.100	l2-norm=4.139e+00
file=results.temperature_skewness.001	l2-norm=0.000e+00
file=results.temperature_sobol0.100	l2-norm=2.533e+00

Vimdiff

First two values are different for rank > 0.
In this case, Each server rank receives (10000 / 4) values.

AbhishekP · September 24, 2024, 4:11pm

I run a single group of 4 clients for 1 and 2 server ranks. This is how the communication is taking place with 2 rank server where rank 0 only takes the top-half of the mesh and rank 1 the bottom-half. I see no phantom cells being sent before computing the stats after comparing 2 rank server study with a single rank server.

print(X.shape) # X contains 0th timestep serialized for a single rank server study.
print(Y1.shape) # Y1 contains 0th timestep serialized top-half arrays at server rank 0
print(Y2.shape) # Y2 contains 0th timestep serialized bottom-half arrays at server rank 1
i1 = 0
i2 = 0
for _ in range(4):
    print((X[i1:i1 + 10000] == np.concatenate((Y1[i2:i2 + 5000], Y2[i2:i2 + 5000]), axis=0)).all())
    i1 += 10000
    i2 += 5000


(40000,)
(20000,)
(20000,)
True
True
True
True

Another thing I noticed is that in sensitivity_analysis_server.py (regardless of the number of server ranks),

Line 575 pearson_b is always [nan nan]
Line 598 pearson_a is always [0. 0.]

always update the first two values w.r.t each rank’s d_buffer. That is why, we probably see the difference in values between single-rank server and multi-rank server studies (vimdiff). I guess even the single rank server’s first two sobol result values could be wrong as well.

AbhishekP · October 10, 2024, 8:56am

The main issue with these high errors was caused due to the incorrect usage of pearson_A and pearson_B attributes of IterativeSensitivityMartinez. Seems the previously developers just copy-pasted the way other moments were gathered and forgot to access these matrices across one single dimension like pearson_A[param] which was only updating the first two values in each rank’s buffers which were storing kurtosis values. By doing, pearson_A[:, param], we ensure we update all values in the buffer.

But, we still need to update the iterative_stats to ensure that the memory access is contiguous across each param. Right now, [:, param] causes column-wise slicing which is very bad.

Topic		Replies	Views
Statistical differences noticed from ZMQ -> ADIOS2 Sensitivity Analysis	3	36	May 9, 2024
Strange behaviour of Adios with mpirun (when group size > 1 for sobol) Bugs and Error Reporting	0	29	April 4, 2024
Iterative statistics update - [breaking change] Software Updates	0	52	June 29, 2023
OAR fails with deep learning heat-pde example Bugs and Error Reporting	4	30	March 11, 2024
Adios integration General	4	67	September 23, 2024

Sobol results changes for server ranks > 0 (Develop Branch)

Basic config for all runs

L2 norm

Vimdiff

Related topics