Hey @rob,
Issues with the commit
Firstly, I went on your commit and after running the build-and-install.sh
script, I get the following errors.
Building and installing gitlab-runner in virtual cluster temp-
Error: Failed to get definition: files.*.generator must be one of [dump copy template hostname hosts remove cloud-init incus-agent fstab]
ERROR [2024-03-04T12:57:11+01:00] Failed running distrobuilder err="Failed to get definition: files.*.generator must be one of [dump copy template hostname hosts remove cloud-init incus-agent fstab]"
INFO [2024-03-04T12:57:11+01:00] Removing cache directory
Error: open lxd.tar.xz: no such file or directory
I simply replaced lxd.tar.xz
with incus.tar.xz
as it was the only tarball available.
Secondly, When the script calls launch-virtual-cluster.py
, it will try dowloading gitlab-runner
through the network. However, the lxc containers are not able to access the host network. To resolve this issue, I flush the iptables as instructed here.
Lastly, There were also some minor issues such as setting the PYTHONPATH
for ADIOS2
has a different location.
So, my commits were dealing with all of these issues and I was able to resolve these and finally build the VC.
SLURM specific
Logging into VC’s headnode slurm-0
and checking if everything is in place.
(.cienv) john@melissa-slurm-0:~$ ll
total 333229
drwxr-xr-x 9 john john 16 Feb 28 17:13 ./
drwx--x--x 5 root root 5 Feb 29 12:12 ../
-rw------- 1 john john 9881 Feb 29 16:50 .bash_history
-rw-rw-r-- 1 john john 252 Feb 28 17:12 .bashrc
drwxrwxr-x 3 john john 3 Feb 23 10:08 .cache/
drwxrwxr-x 6 john john 8 Feb 23 10:11 .cienv/
drwxrwxr-x 2 john john 3 Feb 23 13:57 .keras/
-rw------- 1 john john 68 Feb 28 17:13 .python_history
drwxrwxr-x 16 john john 31 Feb 28 17:03 ADIOS2/
-rwxrwxr-x 1 john john 341691657 Feb 23 12:45 NVIDIA-Linux-x86_64-535.154.05.run*
drwxrwxr-x 17 john john 36 Feb 28 17:10 adios2-build/
drwxrwxr-x 7 john john 7 Feb 28 17:10 adios2-install/
drwxrwxr-x 12 john john 31 Feb 28 15:01 melissa/
-rw-r--r-- 1 john john 278 Feb 23 10:08 requirements.txt
-rw-r--r-- 1 john john 151 Feb 23 10:08 requirements_deep_learning.txt
-rw-r--r-- 1 john john 386 Feb 23 10:08 requirements_dev.txt
(.cienv) john@melissa-slurm-0:~$ melissa-
melissa-launcher melissa-monitor melissa-server
By default, the compute partition that is created in SLURM is always in a state=DOWN
.
(.cienv) john@melissa-slurm-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 1:00:00 11 down* melissa-slurm-[1-11]
After updating the state using,
lxc exec melissa-slurm-0 -- scontrol update nodename=melissa-slurm-[1-11] state=IDLE
Even though, I update it manually, once I reboot my computer, the partition will go back to the state=DOWN
.
I haven’t confirmed the following yet
If I do NOT flush the iptables and still execute scontrol update state=IDLE
, The partition will go back to the state=DOWN
in a few seconds.
So, flush iptables > update node states > running slurm with the following configuration seems to be working,
// Please make sure that all entries preceded by a comment including
// the "FIXME" keyword are changed before running Melissa with this
// config file
{
// FIXME: the server_filename enables to switch between torch (default) and tensorflow servers (tf_heat_pde_dl.py)
"server_filename": "heatpde_dl_server.py",
"server_class": "HeatPDEServerDL",
"output_dir": "STUDY_OUT",
"study_options": {
"field_names": [
"temperature"
],
// parameter_sweep_size is the number of clients (i.e. simulations) to execute
"parameter_sweep_size": 5,
// num_samples is the number *expected* from the simulation, not the set number
// if this number is not provided the server will get it at client finalization
"num_samples": 100,
"nb_parameters": 5,
"parameter_range": [100, 500],
// this option sets Nx = Ny = mesh_size
"mesh_size": 100,
// this option yields dt = 1 / time_discretization but does not change num_samples
"time_discretization": 100,
"seed": 123,
"simulation_timeout": 400,
"crashes_before_redraw": 1000,
"verbosity": 2
},
"dl_config": {
"n_batches_update": 10,
"batch_size": 10,
"per_server_watermark": 100,
"buffer_size": 6000,
"zmq_hwm": 10,
"buffer": "FIFO"
},
"launcher_config": {
"scheduler": "slurm",
"scheduler_arg_server": [
"--nodes=1",
"--ntasks=1",
"--threads-per-core=1",
"--hint=nomultithread",
"--time=01:00:00"
],
"scheduler_arg_client": [
"--nodes=1",
"--ntasks=2",
"--time=00:30:00",
"--exclusive"
],
"job_limit": 5,
"timer_delay": 1,
"fault_tolerance": false,
"verbosity": 2
},
"client_config": {
// FIXME: the executable command needs to be replaced with the appropriate path
"executable_command": "~/melissa/examples/heat-pde/executables/build/heatc",
// all bash commands to be executed on the job node prior to melissa study
"preprocessing_commands": [
"source ~/.cienv/bin/activate"
]
},
"server_config": {
"preprocessing_commands": [
"source ~/.cienv/bin/activate"
]
}
}
It produces the STUDY_OUT/checkpoints
folder.
Questions,
- Did you face any issues with SLURM nodes going down like this?
- Is the sample configurations correct for testing SLURM?
- Probably unnecessary but, were you able to use the local GPU with LXC?
I tried runninglxc config set melissa-slurm-0 nvidia.driver.capabilities=all nvidia.runtime=true
as well as installed the cuda-toolkit inside the container but it did not work.