Optimizations on NVIDIA multi-GPUs 
==================================

.. _strategies

This chapter describes optimization for **R&G distributed** runs of QuantumESPRESSO on NVIDIA GPUs, in particular **Leonardo Booster**.

Reducing the memory footprint 
---------------------------------

.. note::

   The optimizations described in this section improve the **memory footprint** of QE simulations; by enabling runs on a smaller number of nodes the overall cost of PWSCF simulation is reduced. 

OpenACC data management
^^^^^^^^^^^^^^^^^^^^^^^

The offload of PWSCF on NVIDIA GPUs is based on both **OpenACC** and **CUDAFortran**. When running QE with high verbosity, the logfile reports the amout of GPU memory used per scf iteration. In the develop version after 7.2 release, we observed an unexpected increase of GPU memory per scf step; this increase prevents achieving convergence on the minimum number of nodes for some workloads, e.g. the Cri3 previously benchmarked in older versions of the suite. By tracing the cuda memory usage of the application with NSight Systems, this increase has been resorted to OpenACC memory management.

.. note::

   Since allocation/deallocation is expensive, when deleting data from the GPU with OpenACC, the runtime might not actually free memory from the OpenACC memory pool back to the CUDA driver, in order to re-use it for the next allocation. Thus, the out-of-memory issue could arise in codes where the memory is managed both by CUDA driver and OpenACC; the data management pool of the latter can indeed grow enough so that the CUDA allocation will fail.

To avoid this optimization, which reduces the runtime but increases the memory footprint on GPUs, the environment variable

.. code-block:: console

   export PGI_ACC_MEM_MANAGE=0

can be used. This solved the issue for the non-aware version of the code, but not for the aware version; the latter case is however characterized by an increase in memory as reported on the logfile (see snapshot below).

.. code-block:: console

   # non aware case, output of $grep "GPU memory" cri3.out
   ./cri3.out:     GPU memory used/free/total (MiB): 14627 / 50341 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 13259 / 51709 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 13451 / 51517 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 13451 / 51517 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 13823 / 51145 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 16065 / 48903 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 16257 / 48711 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 19481 / 45487 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 13109 / 51859 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 10587 / 54381 / 64969

   # aware case, output of $grep "GPU memory" cri3.out
   ./cri3.out:     GPU memory used/free/total (MiB): 28273 / 36695 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 42121 / 22847 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 46437 / 18531 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 46437 / 18531 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 46437 / 18531 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 48717 / 16251 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 49367 / 15601 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 49743 / 15225 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 27151 / 37817 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 40999 / 23969 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 45315 / 19653 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 45315 / 19653 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 45315 / 19653 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 47595 / 17373 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 48245 / 16723 / 64969
   ./cri3.out:     GPU memory used/free/total (MiB): 48621 / 16347 / 64969


IPC caching
^^^^^^^^^^^

Further application tracing with NSight Systems identified the origin of this additional increase to the **IPC transport layer**, that is used in FFTXlib when distributing over multiple nodes. Indeed, we observed that in FFTXlib the non-blocking MPI APIs for intra-node communications (MPI_Isend+MPI_Irecv+MPI_Waitall) invoke the CUDA API *cuIpcOpenMemHandle*, without a corresponding *cuIpcCloseMemHandle*. 

.. note::
   This behavior is determined by a caching mechanism implemented in the library, which holds on to registrations even after the data transfer is complete, as being expensive. 
      
This issue has been recently experienced also by other codes implementing similar communications with OpenMPI, as reported in `this issue <https://github.com/open-mpi/ompi/issues/12849>`_ .

In the case of QuantumESPRESSO, when inter-node MPI communications dominate the time spent in MPI, IPC transport layer can be quenched in favour of alternative ones (e.g. cuda_copy) with 

.. code-block:: console
 
   export UCX_TLS=^cuda_ipc

.. warning::
   Despite reducing the efficiency of intra-node communications, this environment variable optimizes overall performances: in the distributed FFTXlib, MPI tasks communicate in **All-to-All patterns** and the transfer time is dominated by the slowest path, that is inter-node communications (Infiniband). Limiting the cache size is an alternative solution to be tested to preserve NVLink for intra-node communications.


Interconnect optimizations for FFTXlib
--------------------------------------

.. note::

   The optimizations described in this section reduce the **time to solution** for QE runs distributed with R&G over multiple nodes.

The distributed version of FFTXlib for GPUs (batched, aware) uses non-blocking MPI communications based on Isend+Irecv+Waitall APIs. In these *All-to-All* patterns, when using OpenMPI library, the communication protocol is based on the *rendez-vous* or the *eager* protocols, for large and small messages respectively. When using the minimum number of nodes entailed by memory constrains, the protocol is typically *rendez-vous*. 

NIC-MPI binding to increase BW
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On Leonardo, the setup for communications in the OpenMPI implementation are controlled by **UCX parameters**. The most relevant for this discussion are, in particular

.. code-block:: console

   UCX_MAX_RNDV_RAILS=2
   UCX_NET_DEVICES=all

On Leonardo booster each node is equipped with 4 NICs to communicate with other nodes, and these NICs are equally distant from the CPU. When launching aware inter-node communications between two nodes with 4 MPI processes binded to a GPU each, the four processes see the NICs as equally distant, and due to the maximum number of nodes for rendez vous protocol all of them use the first two NICs for communications. This can be checked with NSight Systems, by tracing the application on 2+ nodes with --nic-metrics as tracing option. The timeline view below shows the NIC metrics for a 2 node run on Leonardo without any optimization.

.. figures:: pictures/nic2.svg
  :width: 80%

This configuration is not efficient for FFTXlib when distributing the R&G spaces on a limited number of nodes, because the buffer size for each message passing through the NIC in the all to all communication pattern is large enough to saturate the bandwidth of a single NIC (it is not dominated by latency). In order to improve the overall bandwidth between two nodes in aware communications, we can bind the MPI-GPU device to the nearest NIC, by using the following wrapper:

.. code-block:: console

   #!/bin/bash
   case $(( ${SLURM_LOCALID} )) in
   0) export UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=0  ;;
   1) export UCX_NET_DEVICES=mlx5_1:1 CUDA_VISIBLE_DEVICES=1  ;;
   2) export UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=2  ;;
   3) export UCX_NET_DEVICES=mlx5_3:1 CUDA_VISIBLE_DEVICES=3  ;;
   esac

   echo Launching on $UCX_NET_DEVICES

   $*

ensuring that all 4 NICs are used. 

Test done on FFTXlib with the above binding show an overall improvement, as reported in the graphs at the end of the chapter. 

Favoring rendezvous over eager protocol
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When increasing the number of nodes for R&G distribution, the size of the buffer exchanged reduces, up to the point where eager protocol is favored instead of rendez-vous. In this regime, enabling more rails reduces the efficiency. To quench this behavior, we can decrease the threshold for the rendezvous protocol with

.. code-block:: console

   export UCX_RNDV_THRESHOLD=0

Similar performance can be achieved with the HPCX implementation of MPI avialable from the HPCSDK compiler suite.

The picture below show the performance of FFTXlib library for 3 different grid sizes and different fashions of compilation/runtime optimizations: (i) an OpenMPI installation, (ii) OpenMPI installation with the binder to map NIC and MPI tasks, (iii) OpenMPI installations with the binder and the environment variable for *rendez-vous*, (iv) HPCX ininstallation.

.. image:: pictures/comm/ecut-225-180.svg
  :width: 30%

.. image:: pictures/comm/ecut-800-324.svg
  :width: 30%

.. image:: pictures/comm/ecut-2000-512.svg
  :width: 30%

As highlighted in the two pictures below, the hpcx installation improves the time to solution as well as the scaling efficiency.

.. image:: pictures/comm/openmpi.svg
  :width: 49%

.. image:: pictures/comm/hpcx.svg
  :width: 49%

Overall gain in QE
------------------

To demonstrate the overall performance achived with above optimizations for memory and inter-node bandwidth, we run the simulation of cri3 system (few steps) with increasing amount of optimization, according to what discussed in this chapter. We compare the following setups:

#. No optimization envs: aware case vs non aware case

#. NIC binding: aware case vs non aware case

#. NIC binding without IPC: aware case vs non aware case

#. NIC binding without IPC and reduced rendez-vous threshold: aware case vs nwithout IPC: aware case vs non aware case

The following picture provides the time to solution of the *electrons* routine for the above cases in a R&G distributed run of Cri3.

.. image:: pictures/comm/electrons-cri3.svg
  :width: 100%