Adios integration

Hey everyone!

We are experimenting with the replacement of ZMQ with Adios2 for the API. Our goals include:

  • increase speed (leverage infiniband network)
  • increase data transit flexibility (changing sizes, types etc)
  • ease data structure requirements (adios checks to make sure it is receiving the right structure)

If you want to follow along, check out the MR progress here

Cheers,

Rob

Some feedbacks about our journey to use ADIOS for Melissa NxM data communications:

  • Making each server process an Adios writer is the way to go to have the same code for melissa-dl and melissa-sa. Using a single writer for the server implies a strong synchronization between the server process that Melissa neither enforce or need, leading is some cases to deadlocks.
  • Melissa code is simpler as ADIOS fully take care of the NxM data exchange while we had to do it on top of ZMQ before
  • We did a slight change in the Melissa API, basically introducing the step concept from ADIOS, so it becomes easier for an application to push several fields to the server.
  • An ADIOS instrumented simulation code can directly be used in Melissa without need to use the melissa API.
  • First tests of Melissa+Adios2 on Jean-Zay supercomputers are positive. Still need to check if we can measure significant performance gains compared to ZMQ.

Trying to deploy again the Melissa+Adios stack, using latest ADIOS version, I ran into errors on the ADIOS Python API. It seems that they changed their API (python: rewrite Python high level API in python · ornladios/ADIOS2@ab3ab37 · GitHub). I will push a commit once I manage to make it run again.

This may be quite fast and easy, here is the transition example:

https://adios2.readthedocs.io/en/latest/api_python/api_python.html#transition-from-old-api-to-new-api

Looks like all we need to do is convert the .open

# OLD API
fr = adios2.open(args.instream, "r", mpi.comm_app,"adios2.xml", "SimulationOutput")

# NEW API
adios = Adios("adios2.xml", mpi.comm_app)
io = adios.declare_io("SimulationOutput")
fr = Stream(io, args.instream, "r", mpi.comm_app)

Summary of Adios exploration

Detailed issue can be found: Summarizing ZMQ vs ADIOS (#29) · Issues · melissa / Melissa · GitLab

  • ZMQ vs Adios Comparison:

    • Ease of Use: ZMQ requires knowledge of socket programming, while Adios offers a more user-friendly API.
    • Communication: ZMQ uses TCP and goes through the OS kernel, while Adios leverages faster RDMA transports (e.g., Infiniband).
    • Performance: ZMQ handles simultaneous connections better with threads; Adios performs faster with RDMA but faces overhead due to sequential processing.
    • Data Handling: ZMQ lacks native support for multi-dimensional data, while Adios supports it.
    • Debugging: ZMQ is easier to debug (thread-safe), while Adios debugging is more difficult with threading issues.
  • Adios-RDMA Issues:

    • Installation Complexity: Setting up Adios with RDMA is hardware-dependent and complex.
    • Usability: Requires knowledge of low-level libraries (UCX, PSM2, OmniPath, MPI) to fully utilize RDMA.
    • Resource & Memory Issues: Adios often encounters segmentation faults related to memory and RDMA device configuration (e.g., hfi, mlx).
  • Benchmarking on Jean-Zay:

    • Jean-Zay has OmniPath PSM2 with 100 Gb/s bandwidth per CPU node.
    • Single Client: Achieved 19.1 Gb/s with Adios and 18.5 Gb/s with ZMQ.
    • Multi-client Scenarios: Adios suffers from sequential processing, CPU idle time, and difficulty managing multiple clients, leading to performance bottlenecks.