LUMI General Course

# LUMI General Course 30.5.--2.6.2023 8:00--16:30 (CEST), 9:00-17:30 (EEST) [Course archive: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530](https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530) :::danger This is the archive of the second day of course's Q&A document. You can find the latest Q&A document [here](https://md.sigma2.no/lumi-general-course) ::: --- [TOC] ## General Information - Link to this document: [https://md.sigma2.no/2023-05-lumi-general-course-archive-day3](https://md.sigma2.no/2023-05-lumi-general-course-archive-day3) - [Course archive: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530](https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530) ## Slides, exercises & recordings All slides will be made accessible during the training on LUMI at `/project/project_465000524/slides`. You need to join the training project via the link you received in the email after you signed up. For CSC users that involves setting up a new account via Puhuri. Some training documents will also be published on https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530// # Q&A of day 3 :::danger Please always ask new questions at the end of the document. ::: ## Ice breaker: Which software are using or going to use on LUMI? - Vlasiator (hybrid-Vlasov space plasma simulation) - PyTorch + HIP (if it still makes sense after presentations) - [Kurt] PyTorch makes sense, most of the time is spent in properly optimised code, but you may have to do something about your data set if that consists of many small files. We've had users getting very good results after some optimisation of how they store their data set. However, LUMI is good at AI in FP32 but compared to NVIDIA not spectacular in lower precision data formats for inference. - Thnx - NEMO ... hopefully, GETM for testing - Geant4, if, hopefully, it supports GPU in the future ## Introduction to Perftools - Perftools-lite modules ::: info You can find the downloads of Apprentice2 and Reveal on LUMI in `$CRAYPAT_ROOT/share/desktop_installers/`. This only works when the `perftools-base` module is loaded, but this is the case at login. ::: 1. I get an error when runing the perftools-lite `srun: error: Unable to create step for job 3622893: More processors requested than permitted` **Answer**: Please copy the example again, it might be we added --exclusive to the batch script because we are using the small partitions which share nodes. - thanks, thats it ! 2. I tried perfools-lite on another example and got the following message from pat-report: ``` Observation: MPI Grid Detection There appears to be point-to-point MPI communication in a 4 X 128 grid pattern. The 24.6% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. No custom rank order was found that is better than the RoundRobin order. Rank Order On-Node On-Node MPICH_RANK_REORDER_METHOD Bytes/PE Bytes/PE% of Total Bytes/PE RoundRobin 1.517e+11 100.00% 0 Fold 1.517e+11 100.00% 2 SMP 0.000e+00 0.00% 1 ``` Normally for this code, SMP rank ordering should make sure that collective communication is all intra-node and inter-node communication is limited to point-to-point MPI calls. So I don't really get why the recommendation is to switch to RoundRobin (if I understand this remark correctly)? Is this recommendation only based on analysing point-to-point communication? - **Answer**: Yes, you understood the remark correctly. This warning means that Cray PAT detected a suboptimal communication topology and according to the tool estimate, a round-robin rank ordering should maximize intra-node communications. There is a session about that at the beginning of the afternoon. - **Reply**: I would be very surprised if round-robin rank ordering would be beneficial in this case. I tried to run a job with it, but this failed with: ```srun: error: task 256 launch failed: Error configuring interconnect``` and similar lines for each task. The job script looks as follows: ``` module load LUMI/22.12 partition/C module load cpeCray/22.12 module load cray-hdf5-parallel/1.12.2.1 module load cray-fftw/3.3.10.3 export MPICH_RANK_REORDER_METHOD=0 srun ${executable} ``` **Update**: A second run of the same job ran correctly. The round-robin ordering is about 10% slower than SMP. `pat_report` still tells that round-robin should be the best rank order. 3. ... ## Advanced Performance Analysis 4. Sorry, in Zoom the voice frequently disappears. Could you please speak directly in the microphone? Thank you, now it's much better! - It was a problem with the microphone setting in Zoom after the break, and the audio bar in Zoom still suggested decent audio quality so it was not seen here. There's a bit too much to monitor unfortunately, we can't have a dedicated "broadcast engineer". So thanks for the remark. 5. Do I get it right that perftool can actually point/suggest me the code which will improve /benefit from GPUs? - Not quite. Performance analysis is a pre-requisite for any optimization work. If the code spends a lot of time in MPI or I/O then concentrate on that. If you can identify areas of the code where computation is significant then think about taking those to the GPU. /thanks, got the discussion/ 6. ```MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked``` while running the python example (perftools-python) - Are you compiling without the compiler wrappers, there is an extra library that needs to be linked otherwise. - No compilation is involved as I run a Python script. It is odd that there is something "compiler"-related jumps out. - Are you using mpi4py from cray-python? - ```time srun -n 4 pat_run `which python` heat-p2p.py```, ah, yes, in the imports. ```from mpi4py import MPI``` - Are you online (remote) or in the room? **online** - For GPU applications built without the wrappers you need libraries from here ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} I need to get that gtl library, I need to get Alfio to look. (Alfio is looking but has network issue at the moment) - (Alfio) By any chance, do you have the MPICH_GPU_SUPPORT_ENA5BLED set? **no idea, will check** ... **yes**. Should I unset it? @ yes, this is for G2G MPI. There is a way to preload the library in python, if needed. - Works, thanks! - The issue here is that this envar tells the MPI you want to do GPU to GPU communication avoiding the cpu and to do that it needs this extra library. As Alfio notes this needs special setup in python to get this library. Glad this fixed it. We will talk a little - more about python in a later session. Comment: Hei! I would like to emphasize that Python is a rapidly developing language, which warrants fast version changes. As the language evolves, it introduces new features that users may want to use for their benefit. It also introduces backward incompatibility, as always. I see it as important that users have a choice of versions already as modules (userspace Python is a possibility, but a rather ugly one). The idea applies not only to the training but to LUMI in general. **Answer to the comment:** As long as Python does not have decent version management and decent package management what you ask is simply impossible. The Python community turned Python into a complete mess. Breaking compatibility with older code every 18 mnonths is just a crazy idea. It turns Python in an unstable platform. So users using it should be prepared to deal with an unstable platform that cannot be properly supported. Or just look at the **extremely** poor code management in important projects such as NumPy. If you're looking for an example for a computer science course about how to make something that is unsupportable, NumPy is your perfect example. You don't realise how much time people who work on software installation tools lose with each version trying to get that package to work properly on new platforms. In that light it is not suprising the the version that HPE Cray can provide to us is a bit behind the leading edge. Maybe the Python community should learn how to manage a project in an enterprise quality way if they want enterprise quality support for their tools. By the way, I don't know if we mean the same thing with "user space software installation", but as on an HPC cluster the amount of software that can be installed in the system image is very limited almost all software is installed in "user space", so an application that cannot be properly installed in "user space" is not suited for an HPC cluster. E.g., potential compatibility problems with new system software is not the only reason why we don't keep old versions of the PE on the system. Pure Python is also **extremely** inefficient and all efforts to make a proper JIT platform for Python so far have failed. All work only in a few cases. Now that we are in a era where transistors don't become cheaper anymore so it is no longer possible to get more performance in the next machine by using more transistors without raising budgets considerably, it is actually becoming important to look at better languages that can actually run efficiently. (Harvey) I think is is more a discussion for the pub (Philipp) I agree. I am old enough to witness 2.95 to 3.x transition in GCC, which makes me softer in these matters. Nevertheless, there is no right answer, indeed. (Kurt) the 3.x to 4.x GCC transition wans't exactly painless either. I remember using both in parallel until 4.4 or so. ## Understanding Cray MPI on Slingshot, rank reordering and MPMD launch ::: info Currently heterogeneous jobs are broken in Slurm on LUMI, we are waiting for a fix but have no news yet. - more specifically this applies to heterogeneous applications that need to use cpu and gpu nodes and where we have no workaround. The `lumi-CPEtools` module mentioned yesterday also has support for displaying how heterogeneous jobs are distributed across nodes and cores (but not yet with GPU support). ::: 7. Was indeed perhaps most usefull talk for me :) ! ## AMD ROCgdb debugger ::: info The exercises are here: https://hackmd.io/@gmarkoma/lumi_training_ee#Debugging ::: ## Introduction to Rocprof Profiling Tool ::: info The exercises are here: https://hackmd.io/@gmarkoma/lumi_training_ee#Rocprof ::: 9. General ? (can you please answer here, not Zoom): Does LUMI has any RocketChat or Slack to communicate quickly regarding the current state of the system and tech questions. I am from Swedish EuroCC. **Answer** We prefer communication to go via tickets. It is otherwise impossible to keep track of all the discussions and our experience from phases where we have used RocketChat is that the chat is often abused by people to try to get ahead of the queue, so we're not really considering to further extend the chat. Also, for us what is in the chat is lost as RocketChat is a very bad tool to search in and as there is a policy to remove discussions after 90 days (as is the case in a free Slack), so we have no option to look back what we wrote before in case we get another question where we remember we already answered it once. We also have the [LUMI Status Page](https://www.lumi-supercomputer.eu/lumi-service-status/) where we post status updates. Ok, thanks