Melissa Virtual Cluster Issues

AbhishekP · March 4, 2024, 3:55pm

Issues with the commit

Firstly, I went on your commit and after running the build-and-install.sh script, I get the following errors.

Building and installing gitlab-runner in virtual cluster temp-
Error: Failed to get definition: files.*.generator must be one of [dump copy template hostname hosts remove cloud-init incus-agent fstab]
ERROR  [2024-03-04T12:57:11+01:00] Failed running distrobuilder                  err="Failed to get definition: files.*.generator must be one of [dump copy template hostname hosts remove cloud-init incus-agent fstab]"
INFO   [2024-03-04T12:57:11+01:00] Removing cache directory                     
Error: open lxd.tar.xz: no such file or directory

I simply replaced lxd.tar.xz with incus.tar.xz as it was the only tarball available.

Secondly, When the script calls launch-virtual-cluster.py, it will try dowloading gitlab-runner through the network. However, the lxc containers are not able to access the host network. To resolve this issue, I flush the iptables as instructed here.

Lastly, There were also some minor issues such as setting the PYTHONPATH for ADIOS2 has a different location.

So, my commits were dealing with all of these issues and I was able to resolve these and finally build the VC.

SLURM specific

Logging into VC’s headnode slurm-0 and checking if everything is in place.

(.cienv) john@melissa-slurm-0:~$ ll
total 333229
drwxr-xr-x  9 john john        16 Feb 28 17:13 ./
drwx--x--x  5 root root         5 Feb 29 12:12 ../
-rw-------  1 john john      9881 Feb 29 16:50 .bash_history
-rw-rw-r--  1 john john       252 Feb 28 17:12 .bashrc
drwxrwxr-x  3 john john         3 Feb 23 10:08 .cache/
drwxrwxr-x  6 john john         8 Feb 23 10:11 .cienv/
drwxrwxr-x  2 john john         3 Feb 23 13:57 .keras/
-rw-------  1 john john        68 Feb 28 17:13 .python_history
drwxrwxr-x 16 john john        31 Feb 28 17:03 ADIOS2/
-rwxrwxr-x  1 john john 341691657 Feb 23 12:45 NVIDIA-Linux-x86_64-535.154.05.run*
drwxrwxr-x 17 john john        36 Feb 28 17:10 adios2-build/
drwxrwxr-x  7 john john         7 Feb 28 17:10 adios2-install/
drwxrwxr-x 12 john john        31 Feb 28 15:01 melissa/
-rw-r--r--  1 john john       278 Feb 23 10:08 requirements.txt
-rw-r--r--  1 john john       151 Feb 23 10:08 requirements_deep_learning.txt
-rw-r--r--  1 john john       386 Feb 23 10:08 requirements_dev.txt
(.cienv) john@melissa-slurm-0:~$ melissa-
melissa-launcher  melissa-monitor   melissa-server

By default, the compute partition that is created in SLURM is always in a state=DOWN.

(.cienv) john@melissa-slurm-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up    1:00:00     11  down* melissa-slurm-[1-11]

After updating the state using,

lxc exec melissa-slurm-0 -- scontrol update nodename=melissa-slurm-[1-11] state=IDLE

Even though, I update it manually, once I reboot my computer, the partition will go back to the state=DOWN.

I haven’t confirmed the following yet

If I do NOT flush the iptables and still execute scontrol update state=IDLE, The partition will go back to the state=DOWN in a few seconds.

So, flush iptables > update node states > running slurm with the following configuration seems to be working,

// Please make sure that all entries preceded by a comment including
// the "FIXME" keyword are changed before running Melissa with this
// config file
{
    // FIXME: the server_filename enables to switch between torch (default) and tensorflow servers (tf_heat_pde_dl.py)
    "server_filename": "heatpde_dl_server.py",
    "server_class": "HeatPDEServerDL",
    "output_dir": "STUDY_OUT",
    "study_options": {
        "field_names": [
            "temperature"
        ],
        // parameter_sweep_size is the number of clients (i.e. simulations) to execute
        "parameter_sweep_size": 5,
        // num_samples is the number *expected* from the simulation, not the set number
        // if this number is not provided the server will get it at client finalization
        "num_samples": 100,
        "nb_parameters": 5,
        "parameter_range": [100, 500],
        // this option sets Nx = Ny = mesh_size
        "mesh_size": 100,
        // this option yields dt = 1 / time_discretization but does not change num_samples
        "time_discretization": 100,
        "seed": 123,
        "simulation_timeout": 400,
        "crashes_before_redraw": 1000,
        "verbosity": 2
    },
    "dl_config": {
        "n_batches_update": 10,
        "batch_size": 10,
        "per_server_watermark": 100,
        "buffer_size": 6000,
        "zmq_hwm": 10,
        "buffer": "FIFO"
    },
    "launcher_config": {
        "scheduler": "slurm",
        "scheduler_arg_server": [
            "--nodes=1",
            "--ntasks=1",
            "--threads-per-core=1",
            "--hint=nomultithread",
            "--time=01:00:00"
        ],
        "scheduler_arg_client": [
            "--nodes=1",
            "--ntasks=2",
            "--time=00:30:00",
            "--exclusive"
        ],
        "job_limit": 5,
        "timer_delay": 1,
        "fault_tolerance": false,
        "verbosity": 2
    },
    "client_config": {
        // FIXME: the executable command needs to be replaced with the appropriate path
        "executable_command": "~/melissa/examples/heat-pde/executables/build/heatc",
        // all bash commands to be executed on the job node prior to melissa study
        "preprocessing_commands": [
                "source ~/.cienv/bin/activate"
        ]
    },
    "server_config": {
        "preprocessing_commands": [
                "source ~/.cienv/bin/activate"
        ]
    }
}

It produces the STUDY_OUT/checkpoints folder.

Questions,

Did you face any issues with SLURM nodes going down like this?
Is the sample configurations correct for testing SLURM?
Probably unnecessary but, were you able to use the local GPU with LXC?
I tried running lxc config set melissa-slurm-0 nvidia.driver.capabilities=all nvidia.runtime=true
as well as installed the cuda-toolkit inside the container but it did not work.

rob · March 4, 2024, 4:27pm

Hello Abhishek,

Please ignore the Virtual-Cluster-CI repository, that repository is not meant for installing Melissa on your machine. That repository is not for learning how to use Melissa with SLURM.

Virtual-Cluster-CI is only for configuring the Gitlab runner for the Melissa CI. The Gitlab runner is a machine that automatically runs checks on new commits to the Melissa repository to ensure functionality. Nothing more, nothing less.

Please start over, follow these instructions Running a virtual cluster - MELISSA to create your virtual cluster. Please let me know which parts of this procedure do not work for you.

Thank you,

Robert

AbhishekP · March 5, 2024, 9:34am

Oh Okay. I will redo the installations. I will put aside the CI part for the time being.

Thanks,
Abhishek

Topic		Replies	Views
OAR fails with deep learning heat-pde example Bugs and Error Reporting	4	30	March 11, 2024
New Melissa Repository now Public General	0	81	February 14, 2023
Strange behaviour of Adios with mpirun (when group size > 1 for sobol) Bugs and Error Reporting	0	28	April 4, 2024
Adios integration General	4	64	September 23, 2024
Sobol results changes for server ranks > 0 (Develop Branch) Bugs and Error Reporting	2	17	October 10, 2024

Melissa Virtual Cluster Issues

Issues with the commit

SLURM specific

Related topics