Gysela5D GYSELA5D: a code
GYrokinetic SEmi-LAgrangian in 5D

Memory Scalability Optimisations

To compute a gyrokinetic simulation, large amount of memory is needed. Our work about memory optimisation is motivated by two main reasons. First of all, we aim to simulate large meshes to understand more accurately the plasma turbulence. Then, it is known that the future Exascale machines will certainly have less memory per core than actual Supercomputer.

To improve the memory scalability of a parallel application, the goal of this game is to reduce the memory overhead introduce by the parallization of the code. For example, MPI send/receive buffer, data structure private for each thread, ghost area introduce by the domain decomposition…

Memory Trace File

Thank to an instrumentation of the Gysela code, informations about the memory consumption are collected during the execution. These informations are stored in a trace file. For more information, please refer to this article.

Visualization tool

The visualization tool allow the developer to localize quickly where is the memory peak. On the two figures below, the entry/exit of a subroutine is given on the X axis. The Y axis indicate the amount of memory consumption.

Curve of the memory consumption along the time Array allocation along the time

On the left figure, the memory peak of our simulation is identified. It's easy to localize the subroutine where the global memory peak happen. Once the memory location is known, the right figure show which array are allocated at this time.

To decrease the memory peak, we must avoid the overlapping of array allocation.

Before optimization After optimization

On the two figures above, we have the before/after optimization picture. Thank to this only optimization, the memory peak where decrease of 4.58 GB, either ~10% of memory gain on a simulation over 32K cores.

Prediction tool

After the instrumentation of the whole code, one can notice that generally the boundaries of the array allocated depend on a small set of input parameter. In our case, the input set of parameters is the number of point in each direction of our mesh, and some parallelized parameters.

The idea of the prediction tool is to obtain the memory consumption with a given input set in offline mode. To be able to do this, the trace file must contain the relationship between the array boundaries and the input set parameters if it existe. This tool allow us the study the memory scalability of Gysela in offline mode. For more information, please refer to this article.

Results

Before all optimisations, the memory consumption of Gysela over the mesh 1024 * 4096 * 1024 * 128 * 2 was as follow :

Number of cores 2k 4k 8k 16k 32k
Number of MPI procs 128 256 512 1024 2048
Total per MPI process in GBytes 311.5 179.6 114.2 83.0 67.5

Thank to these tools, a lot of the move of allocations or remove of arrays have been made. After optimisations, we finally obtain a decrease of the memory peak of 50.8 % on the simulation over 32K cores.

Number of cores 2k 4k 8k 16k 32k
Number of MPI procs 128 256 512 1024 2048
Total per MPI process in GBytes 261.5 145.9 81.9 52.3 34.3