# Moving your AI training jobs to LUMI Workshop 04.--05.02.2025 9:00-16:30 (EET), 8:00--15:30 (CET) Espoo & online :::info Please ask your questions at [the bottom of this document](#EOF) <-- click here ::: --- [TOC] ## General Information - Link to this document: [https://md.sigma2.no/lumi-ai-workshop-feb25](https://md.sigma2.no/lumi-ai-workshop-feb25?both) - **Zoom link: https://cscfi.zoom.us/j/64948027353?pwd=nJ71W4e6TgT5GvewIpnfWf8uKgILp4.1** - On-site in Espoo: Life Science Center Keilaniemi Keilaranta 14 02150 Espoo https://csc.fi/en/about-us/contact-information/#espoo-directions-and-accessibility ## Schedule (all times in EET) You can find the schedule here: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20250204/schedule/ ## Events ### Next public HPC coffee break **26.02.25, 13:00--13:45 (CET), 14:00--14:45(EET)** Meet the LUMI user support team, discuss problems, give feedback or suggestions on how to improve services, and get advice for your projects. (Almost) every last Wednesday in a month. [Join via Zoom](https://cscfi.zoom.us/j/68857034104?pwd=UE9xV0FmemQ2QjZiQVFrbEpSSnVBQT09) ## Slides, exercises & recordings The training material including recordings will be published in the [LUMI Training Materials archive](https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20250204/) (shortcut). These pages also contain pointers to the recordings and a permanent place where the slides will be stored, also when the training project ends. The shortcut [lumi-supercomputer.github.io/AI-latest](https://lumi-supercomputer.github.io/AI-latest) will always refer to the latest AI training with complete materials. The exercises can be found in this GitHub repository: https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop The LUMI AI Guide can be found here: https://github.com/Lumi-supercomputer/LUMI-AI-Guide There are two Slurm reservations for the course. One for each day: - First day: `AI_workshop_1` (on the small-g Slurm partition) - Second day: `AI_workshop_2` (on the standard-g Slurm partition # Q&A of day 1 The Q&A of day 1 can be found here: https://md.sigma2.no/lumi-ai-workshop-feb25-archive# ## General Q&A 37. Prettier way to activate conda? Right now I am using `lumi-pytorch-rocm-6.2.3-python-3.12-pytorch-v2.5.1.sif` I need to activate conda to have access to the pytorch modules, so the current hack it to put this in my .sh file ``` srun singularity exec $CONTAINER \ bash -c "\$WITH_CONDA; python3 $PYTHON_FILE --max-samples=256" ``` Is there a better/prettier way to activate conda so I don't have to put all arguments inside the quotes and the bash call? - There is a trick and we use it in the EasyBuild modules that we provide for individual containers. The conda activate script just sets a number of environment variables so you can just set those from outside the container. The one tricky one, but one that is not really needed, is to adapt the prompt to show that you're in the virtual environment. It has been a continuous discussion in LUST whether we should expose the container as much as possible as a container, and the singularity-AI-bindings module combined with the manual `$WITH_CONDA` is a result of that, or try to hide as much of the complexity as we can, which is what we do with the individual modules for each container. In fact, what `$WITH_CONDA` does, is not the same for all containers (if only because the name of the conda environment is not always the same), so that cannot be done in a module as generic as `singularity-AI-bindings`. Those EasyBuild modules are not perfect and not all modules implement all functionality, basically because our time is also finite and we don't want to invest time in things that we don't know will be used. But we're always open to further develop them on request. With some of the newer PyTorch EasyBuild containers, your command would become ``` srun singularity exec $CONTAINER \ python3 $PYTHON_FILE --max-samples=256" ``` ## Icebreaker question of day 2 To how many GPUs or nodes do you want to scale? - 256+ GPUs since pretraining LLMs at scale usually sees a significant drop-off in effiency after this magic number on European HPC center - As many as needed (+2) - About 4-8 GPUs - 8 gpu, single node is enough for me - 2-4 nodes x 8 gpus for pretraining a vision foundation model # Q&A of day 2 ## Scaling AI training to multiple GPUs 38. If I ask for one complete node and use “ddp”, how many maximum num_workers should I use ? should it still be num_workers == cpus-per-task=7 - I assume you mean the number of workers for the dataloaders. These are "orthogonal" to the processes for the GPU. For each GPU, you will have one "main" process on the CPU, and torchrun will take care of running these for you. Each of those will then spawn the number of dataloader worker processes that you set up. These just handle preprocessing of data to be ready to be send to the GPU. Since the throughput of data at each GPU shouldn't change (much) compare to running with a single GPU, if you needed 7 dataloader processes (num_workers), it makes sense to keep it the same when running with multiple GPUs. However, in general, if you really want to find out just how many you need, you should try different values and see how that affects performance or profile you run. - Lukas 39. Is there a 'cpus-per-gpu' variable? - There is, but as I recall there was some issue with that in the SLURM setup on LUMI which results in the GPUs not being able to communicate with each other when using `cpus-per-gpu`. So you don't want to use that. - Lukas - This communication problem is actually explained in our [regular introductory courses](https://lumi-supercomputer.github.io/intro-latest) in - I think - the Slurm or the binding presentation. The version of Slurm that we have on LUMI, tries to bind GPUs to tasks via a mechanism known as "control groups". Unfortunately, these control groups break the more efficient ways of communication between GPUs. There are several ways around it, but the way presented in the talk may be the better choice. - In general, Slurm has a lot of parameters that don't always work completely the way you expect. When reading the manual pages of `sbatch` and `srun`, really each word is important. Another common cause of the system refusing jobs is actually overspecifying resources and in that way, introducing conflicts. E.g., several parameters that try to bind CPUs and/or GPUs to tasks. 40. Hi, which container has lightning module? lightning handle this ddp automatically I guess. Can I install this in existing containers? - You can check what is in a container with `pip list`, e.g., ``` singularity exec lumi-pytorch-rocm-5.7.3-python-3.12-pytorch-v2.2.2.sif bash -c '$WITH_CONDA; pip list' ``` or even ``` singularity exec lumi-pytorch-rocm-5.7.3-python-3.12-pytorch-v2.2.2.sif bash -c '$WITH_CONDA; pip list' | grep lightning ``` And a nice one-liner: ``` for file in $(/bin/ls /appl/local/containers/sif-images/lumi-pytorch-rocm-*); do echo -e "\n$file\n" ; singularity exec $file bash -c '$WITH_CONDA; pip list' | grep lightning ; done ``` that will show you that almost all containers have it. - And you can use the mechanism from the last talk of yesterday (the virtual environment) to add the package if it is missing in your otherwise favourite container. 41. What could cause this -> using DDP and multiple GPUs, increasing batch_size slows down the process? - This should be expected, as the batch size represents the amount of data to process at each iteration. Increasing the batch size results into more data to process, thus more time to finish one iteration. - But with multiple GPUs there likely is a sweet-spot for batch size. If the batch is too small, then each GPU will spend most of the time waiting on communication with other GPUs, so then you want to increase it a bit and can process more data in the same time. But if the batch (per device) is too large, then you would indeed expect to see that each iteration will take longer. 42. How to set module paths in Jupyter notebooks? e.g. to test quickly something with the installed packages in a container in Jupyter notebook. - I guess you cannot use Jupyter in LUMI web UI with a custom container, unless you make some kind of wrapper script. If you have a virtual env or some Python packages installed on disk you can probably make them visible to Jupyter by setting the right environment variables using the Advanced -> Custom init: Text option when launching Jupyter. 43. Follow-up question to the 'How can we make sure whether the parallel jobs accumulate the gradients like they should?': Model might still converge while working on a single gpu, I'm not sure checking convergence is the right way. Any more ideas on this? - Another way is to keep an eye on the number of items being processed over time. It's suprisingly tricky to be 100% sure though without going deeper into the code, as things are so much hidden in the modern frameworks. ### Exercises :::info Find the exercises of this lesson here: https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/08_Scaling_to_multiple_GPUs Check the folder `reference_solution` if you need help or describe it below. The slides can be found here: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20250204/index.html#course-materials ::: 44. In a slurm script what would be the difference between 'torchrun script.py' and 'srun python script.py' (in terms of performance) ? - Even though the behavior may vary slightly from case to case, in all the tests I've run, there was no appreciable difference in terms of performance. - There is an important difference when not using torchrun: srun will take care of running `ntasks-per-node` many processes of the command that follows. So if you drop `srun` in that line, it will run only one `python ..` process, even if you set the `#SBATCH` number of tasks to a higher number. 45. For exercice 8, I checked the GPU utilization from watch rocm-smi and GPU% was 100% for every GPU except GPU 0. Also PwrCap was 500W for only 4 of the 8 GPUs. I would like to check if I understood correctly, for this case, shouldn't it be 500W for all of them, since I was using all? Does this means that it is not using all the GPUs properly? It was running for 5 minutes only - The powercap part of the question: PwrCap is a parameter that is set by the sysadmins. It may not be exactly the same on all GPUs for performance reasons. Some GPUs can run faster at a given power than others, that is normal variability in semiconductor production. Sysadmins did experiment a bit with powercaps to get an equal speed from all GPUs (important to ease load distribution) while not exceeding power limits per rack, so yo may have bumped in some better ones that require less power for a given speed. The power usage parameter is the one to look at, but even that doesn't really tell you if you are using the resource efficiently. It only tells that you are using it to some degree but not how good that use is. E.g., you may be spending a lot of power on moving data rather than on doing computations with the data, and that is something only advanced profiling can show you (with a tool like Omniperf). - I see. So, to check the quality of the GPU usage, do you recommend to perform advanced profiling? 46. Does this 'TOKENIZERS_PARALLELISM=false' mean that the user has to handel the parallism themself explicitly ? If this is the case, could you point us to the python lines in the exercice where this is done - thanks? - The transformers package sometimes uses tokenizers written in some other language than Python (e.g. C). These are typically (much) faster than the equivalent Python implementation and some of these do additional parallelism on top of the number of dataloader processes we instruct the script to use. Since we already use multiple processes for data-loading, we set `TOKENIZERS_PARALLELISM=false`. - Lukas ## Extreme scale AI 47. Is there a difference between `HIP_VISIBLE_DEVICES` and `ROCR_VISIBLE_DEVICES`? Are they aliases? - `ROCR_VISIBLE_DEVICES` works at a lower level in the software stack. It applies to all applications using the ROCm stack. `HIP_VISIBLE_DEVICES` only works for applications using HIP under the hood. See also [this page in the ROCm documentation](https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html) 48. HIP is the CUDA-like abstraction layer? and ROCR to low-level GPU kernel layer? HIP is based on ROCR? ROCR is the related as ROCM? - ROCR stands for ROCm Runtime. HIP is indeed AMD's alternative to CUDA, but ROCm is not only low-level. The runtime is, but ROCm also contains a lot of libraries with similar functionality as CUDA libraries, and then many of those libraries also have HIP alternatives, but that is then only a translation library that translate CUDA-like library calls to either the ROCm libraries when compiling on AMD or CUDA library calls when compiling on CUDA. For a more complete overview, see the ["Introduction to the AMD ROCm Ecosystem" talk in the advanced training](https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20241028/extra_2_05_Introduction_to_AMD_ROCm_Ecosystem/). 49. If building my own container, do I have to match the ROCm installation of LUMI (6.0.2) to a specific AWS-CXI plugin version? - These are developed independently from one another with their own update schedules, so not that I know. AWS in the name actually stands for Amazon Web Services as it was developed to run on AWS's own ethernet implementation which also uses libfabric. 50. I tried adding aws-ofi-rccl to my own (sglang) container, but got out an error of "nid005018:47819:47913 [0] create_nccl_ofi_component:817 NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented" for my troubles in the end. Any idea what caused this? - You might need to bind some additional LUMI system libraries for this to work properly, i.e., the ones set in the `singularity-AI-bindings` model that was shown yesterday. As was mentioned yesterday, building the containers from scratch is not trivial since we cannot make all of the libraries needed available due to license restrictions. So it's usually best to start from the ROCm containers provided by AMD/LUST in `/appl/local/containers/sif-images/` and then use the additional binds to know that the basic setup is correct. - Lukas - I added libs from outside singularity until ldd was happy (except for warning ./librccl-net.so: /lib/x86_64-linux-gnu/libnl-3.so.200: no version information available (required by /lib/x86_64-linux-gnu/libcxi.so.1)). - As for basing on the LUMI images, I also have a workflow for that, but the problem is now that e.g. vLLM and sglang both do so many tweaks and things in their own Docker builds that matching them on top of a LUMI image starts to be insanely complex. Would be reaaaaaally much nicer if it could be made easier to inject the required additional LUMI stuff on top of these images - For vLLM in particular there are pre-built containers for LUMI. - Yes, but they have too old vLLM versions ;). I did get a new vLLM built on top of the LUMI container but it was a pain. And for sglang, there's even more complexity. - At least with the CSC PyTorch you should be able to add newer vLLM like this (e.g. with venv): https://github.com/CSCfi/singularity-recipes/blob/main/pytorch/pytorch_2.5_lumi.def#L144 - CSC PyTorch on LUMI BTW doesn't seem to have aws-ofi-rccl? - Yes it does have aws-ofi-rccl, but if you don't use the module load it might not set the right bind paths to get libcxi etc from the host. - Oh. Where is it in the container? Couldn't find it when I was trying. Ah, okay, found it in `/usr/local/lib/librccl-net.so`. - The CSC PyTorch is built with the definition file linked to above, so you can check from there. - Anyway, the core point is that a) stuff like vLLM and sglang move forward now so fast that using versions from even a month ago is both limiting and inefficient, and b) trying to reprocuce their builds on top of a custom container is hard, particularly because the Dockerfile.rocms for both are starting to include all kinds of important tweaks in themselves in complex hierarchies, e.g. https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm -> https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.rocm . So, an option to flip the tables and a workflow/instructions for instead injecting the LUMI special requirements on top of the base ROCM images would be extremely welcome. - I would assume you can also add aws-ofi-rccl on top of one of those containers (+ plus bind libcxi), although I haven't tried that myself. - Yeah, well I tried that on top of the sglang image and got the error in the initial question :D. So, someone from LUST/CSC trying this on various images and writing instructions + observations on what needs to be taken into account for this to work would be awesome. - What you ask, is simply not possible. Containers are not portable enough. If the developers of the container don't take LUMI requirements also into account when building their containers, there is no way to inject them afterwards. Contrary to popular belief, containers are not a miracle solution for portability. They rely on hardware properties, OS kernel, driver versions, etc., and if those are different, you may not be able to run the container on your system. There is even no universal way to inject other libraries in the container, as what you need to inject, depends on what is already in the container, and the library versions needed by stuff you inject in a container, may conflict with the versions that are already in a container. If you want that ease-of-use, your only solution is to request compute time on a system that is as similar as possible to the one that the developers of those containers use. - Partially yes, but what I envision is some kind of an accounting of precisely the boundaries here. Not looking for being able to run ready container off the bat on LUMI, but on more rounded info on what exactly needs to be tuned for LUMI, so that I could better know how I need to modifty them. For example, the CSC container seems to build off of rocm/dev-ubuntu-22.04:6.2.4-complete, which is very alike the sglang ROCM container I was trying to get working (and unlike the LUMI host system). And okay, the aws-ofi-rccl seems to be working on the CSC version and not on my current sglang attempt, so clearly there are intricacies here that can't be fully explored, but I'd basically be happy with even just some "implementation notes" on issues the CSC and/or LUST people encountered when building your containers, from which starting point triaging the issues would be easier. Like, there are recipes now for mounting some required libs into singularity from outside, but I haven't seen a clear explanation anywhere on why exactly this is necessary. I assume libcxi is the key piece of proprietary code that can't be packaged/built in the container? But then, next step, what are e.g. its boundaries in terms of the versions of libraries it requires to work, and could they be mapped somehow? Could libcxi be linked statically so that it would not depend on further sos? Is also libfabric something that strictly needs to be mounted from the outside/version matched? etc... - Nobody knows all those boundaries, not even the developers of those libraries as they have only been tested on those OS versions that are relevant to them. You have to understand that Linux is not compatible with Linux due to too many different distributions. So the approach within LUST is to stick to a build as close as possible to the system, which is why we tend to start from a SUSE distribution equivalent to the one on LUMI. So at the moment, we'd use, e.g., OpenSUSE 15 SP5. Testing the things that you would like to be tested, is really a combinatorial problem as there are so many libraries involved... - Completely understand. Would still appreciate even more "notes-style" writings attached to the EasyBuild scripts etc outlining the reasons for everything. Much of this is already there, but not the "non-end-user"-facing parts, which are interesting for people who'd like to not only use what is there, but to build off it. 51. Is `export NCCL_SOCKET_IFNAME=hsn0, hsn1, hsn2, hsn3` equivalent to `export NCCL_SOCKET_IFNAME=hsn`? - Yes, I believe so. If it works according to NVIDIA docs: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname . - Unfortunately, I cannot find a proper explanation in the ROCm documentation which only suggests the first form. - The CSC PyTorch module on LUMI uses the latter form and it appears to work correctly :-) 52. If I make my own container, is there some documentation how to setup RCCL correctly? In my context, I will not PyTorch (using julia...) but I plan to use the generic rocm containers from LUMI (e.g. `/appl/local/containers/sif-images/lumi-rocm-rocm-6.2.2.sif`). - If you have ROCm installed in your container, RCCL will be installed with it. These days, the latest pytorch wheel files also comes with RCCL. So typically, you don't need to do anything to setup RCCL as the right defaults are already used. If you install pytorhc yourself from their official drops in https://download.pytorch.org/whl/torch/, make sure you use a version that matches the ROCm version in the container. If you use other frameworks you can leverage RCCL that comes with ROCm. - For performance you have then to have the CXI plugin. This is a runtime dependency, so it only needs to be in your LD_LIBRARY_PATH or installed in a folder your container loader will look for dynamic libraries. - If you use the LUMI provided container, the plugin is already installed for you. No special setup that needs doing. - export NCCL_DEBUG=INFO is your friend to understand the plugin is being picked up. 53. I would like to understand better what is the difference between RCCL (or NCCL) and the slingshot network. The first is related to communication between the GPUs on a node and the other between the nodes? - Slingshot is the name for a piece of hardware. As every piece of hardware, it comes with a driver and some user level software components, e.g., the CXI provider which connects the driver to the standard libfabric library. RCCL/NCCL is a communication library that works at a slightly higher level in the communication stack. As RCCL by default cannot talk to libfabric but only to UCX which is yet another library not supported on the Slingshot hardware and driver, or fall back to regular TCP/IP sockets, it needs a plug-in to let it talk to libfabric and that is where the AWS plugin comes in the story. RCCL is part of the ROCm software stack and mimics NCCL which is a library in the NVIDIA CUDA stack. We need RCCL to connect to libfabric. libfabric is an open-source network library used by a lot of network vendors. libfabric then uses the CXI provider to talk to the Slingshot driver and hardware. CXI is proprietary but being open sourced currently by HPE, but there are still problems with the open source build (and it also seems to be for a newer version of the network driver than we have). The C in CXI actually stands for Cassini which was the code name for the network interface card during the development. 54. For large/extreme scaling of models which require model/tensor/pipeline parallelism, what is the best way to check scaling effiency (weak vs strong?) - Strong usually means your local batches to decrease proportionally with the GPUs whereas weak will keep the local batches and have more data processed in a given time window. So the Weak vs strong setup has to be defined by the user according to the needs of their model. Then, to check the behaviour I recommend using the pytorch profiler infrastructure that leverages rocprof libraries under the hood - check the monitoring slides of lesson 4. With the resulting file, you can then load in https://ui.perfetto.dev. You can see things like the image below. The purple pieces with nccl in the name are communication. I can see exactly the proportion of the time spent in communication. You can also do other things like zooming in and see if there are gaps, with no activity, meaning that you might be bound by CPU or even I/O. There are also ways to show summaries of the information. ![](https://md.sigma2.no/uploads/bd292583-0083-4ffa-8095-b88bee189b6c.png) 55. In the PyTorch exercise, how to enable compiling pytorch during the runtime ? - Does it require MiOpen lib ? - Can you clarify what you mean (or hope to achieve) with compiling Pytorch? - If you mean just-in-time compile there might happen at different levels. MIOpen uses that (and should be available for you when you install ROCm), but you can also see that through the Triton module and others. - I think it should work out-of-the-box usually? It doesn't work for you? 56. I did not quite understand the `-cpu-bind=mask_cpu:0xfe000000000000,...`. How does the option change when we change the number for CPUs/GPUs? - The idea behind the setup of a mask is to guide SLURM to use a given set of CPUs for increasing ranks. As stated in the talk, frameworks make assumptions on which GPU to use for a local rank, so typically rank 0 uses GPU 0, rank 1 uses GPU 1, etc. Extrapolating beyond the node you will get rank N using the GPU N%Number_Of_GPUs_Per_Node. So, this is the assumption we start from. Now, we need to assign CPUs properly to this assumption. For this we can use the reference image below. There we see that GPU 0 is connected to CPUs 48 to 55. We also know that LUMI reserves the first CPU of each L3 domain, so we will have 49 to 55. SLURM can take this information as a bit mask and that is what goes in the `-cpu-bind=mask_cpu:` option. In binary the mask for our 49-55 CPus would be: `11111110000...0000` (with 49 zeros). We then use the hexadecimal representation of this bit mask in the option. We repeat this for every GPU, hence we will have 8 masks. SLURM uses this mask in a round-robin fashion, so when you get to the second node (rank 8 onwards) it applies the mask list from the start. ![](https://md.sigma2.no/uploads/0041d40d-e859-431e-94d4-54740decf6ce.png) 57. Should we set up cpu-bind-mask if we are using less than 8 GPUs? - You cannot do proper binding if you are not on an exclusive node as you don't know which cores and which GPUs you will get. Unfortunately, not doing proper binding also comes with a performance penalty. Ideally you would "right-size" the computer you use with the problem you're trying to solve, and LUMI nodes are just too big for users with smaller problems to be fully efficient. This talk tries to compress materials that take a lot more time in our introductory course to explain all details... E.g., it is an important part of the ["Process and Thread Binding" talk in the introductory course](https://lumi-supercomputer.github.io/LUMI-training-materials/2day-20241210/M08-Binding/). 58. Can you provide some information on how to call the [wrapper script](https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/09_Extreme_scale_AI/reference_solution) for the LLM example ? - That is exactly what is here (exercises https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/09_Extreme_scale_AI): - Yes, you simply put the script before the actual command you want to execute ``` MASTER_ADDR=$(scontrol show hostname "$SLURM_NODELIST" | head -n1) \ srun -N1 -n8 --gpus 8 \ --cpu-bind=mask_cpu=0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000000fe000000,0x00000000000000fe,0x000000000000fe00,0x000000fe00000000,0x0000fe0000000000\ singularity exec \ -B .:/workdir \ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif \ /workdir/run.sh \ python -u /workdir/GPT-neo-IMDB-finetuning-mp.py \ --model-name gpt-imdb-model \ --output-path /workdir/train-output \ --logging-path /workdir/train-logging \ --num-workers 7 ``` ## Loading data on LUMI 59. Why is the LUMI-O access key only valid for 72 hours (if I recall correctly)? This is a strong cutoff and makes use of it very limited to manual operations, making automation redundant. - It is 168 hours, but it is limited for security reasons. If someone gets hold of your key, they can do anything with your storage until you revoke the key. You can extend the lifetime of a key, but this is indeed a manual process as this is when you need to authenticate and prove that you are a valid LUMI user. - have you considered bucket level keys like with MinIO? - We are not responsible for LUMI-O, CSC is, so nobody here can answer that. However, given the budget to manage the system and to support users is very low, I'm pretty sure they chose to have something that they can do with little administration work and not too much support work. E.g., it is a more limited solution than the object storage solution they have for Finnish users. LUMI-O is not meant to be a complete object storage solution for everybody with all possible bells and wistles, but really something that is mostly meant as a system for getting data in and out of LUMI as some of the cloud storage tools perform better on high latency, long distance network connections than sftp or rsync over ssh. - That being said, I did suggest once to extend the key lifetime to 192 hours so that you could book a fixed time in your agenda once a week to extend all the keys. :::info Don't be concerned about files/objects disappearing when a key expires. This does not happen. In fact, objects are not linked to the key you used, only to your project, and buckets and objects stay until 90 days after the end of your project. ::: 60. Is the stripe count specific to LUMI? Or could, for example, striping a zipped file (yet deflated) of images with a stripe count equal to #OST potentially be faster than 1000+ files. Assuming also parallel FS - Stripe count is something that depends on how a file is being used. If you don't have multiple parallel reads or writes in a file from different threads or processes, having a higher stripe count will not help you as you will only be talking to a single OST at a time anyway. - And the zip file with 1000+ images is not to save work for the object storage servers, but for the metadata server that cannot handle all those qicl accesses. - But assuming multiple workers accessing the same file? Does not have to be ZIP, any packed data format - Not sure about the tools to access zip files, if they would properly deal with that. But in "traditional" HPC, HDF5 and netCDF are popular formats for such work and they are well supporting parallel access. - HDF5 is not efficiently applicable to variable-length datasets :( like untokenized text 61. Is /tmp on the compute nodes physically on disk or mapped in memory? I guess a 1GB file in /tmp is accounted against my memory allocation. Can you confirm? - This was answered yesterday. Regular compute nodes are all diskless as disks wear out quickly and cause node failures and on a pre-exascale machine you want your node to be as reliable as possible. Appart from the fact that on shared nodes it would also be problematic to clean the disk properly. - Something to think about: If nodes failures would be independent and one node would have a failure once every three years, then the chance that a parallel job running on 1000 nodes runs for 24 hours without crash, is less than 40%... Just to show you that you need extreme node reliability on a machine as LUMI. I'm not a great fan of Elon Musk, but there is one thing that he says that is all too true for supercomputing: the best part is no part if you want reliability. - And of course it is counted against your memory allocation. Otherwise you could crash other jobs on the node by using all memory. 62. in "#files<#OSTs" what exactly is OSTs? Number of CPUs/GPUs trying to access data or something else? - `#OSTs` refers to the number of actual storage servers in the filesystem. For LUMI on /project, /scratch and the user directories, this is 32. (cf. https://docs.lumi-supercomputer.eu/storage/parallel-filesystems/lumip/). For /flash, there are 72. - Lukas 63. Is GPUDirect storage technology enabled by default on LUMI-F/P/O, or how to enable it ? - Sorry, but LUMI is not an NVIDIA machine so GPUDirect does not apply to LUMI. This is something that NVIDIA needs because they lack some elements that AMD has in their hardware and software stack. GPUDirect basically is extra software to realise things that are default on LUMI. It is basically to transfer data over DMA or RDMA to devices (like the network) without first copying to CPU memory and instead exploit network cards that are close to the GPU (e.g., on the same PCIe switch). AMD does not need extra software for that. - Now specifically for the storage: I checked some GPU Direct documentation and as far as I can see, GPUDirect storage is not compatible with Lustre, so on your NVIDIA cluster it will not work either with data on Lustre. It is meant to create a direct path to storage via DMA or RDMA, and that is not how Lustre works. It is meant to be used with local NVMe storage, or technologies such as NVMe-over-Ethernet, and this is a type of storage that we do not have on LUMI. So the topic is irrelevant for LUMI. ## Open Q&A 64. Did anyone manage to make the hands-on exercise for 09 Extreme scale AI work? I get 'python: can't open file '/workdir/GPT-neo-IMDB-finetuning-mp.py': [Errno 2] No such file or directory' with the reference solution. - Where do you have your files? How is your directory structure when running the example? - All (run.sh, GPT-neo-IMDB-finetuning.py, utils.py) in /projappl/project_465001707/hizlicag/Getting_Started_with_AI_workshop/09_Extreme_scale_AI - How do you run after allocating the node? The correct file name is GPT-neo-IMDB-finetuning.py, not GPT-neo-IMDB-finetuning-mp.py in your case. You either need to rename the file or change the run command to use the correct file. - With the provided command for the single node example. - Oh, the file names are different! Thanks! 65. About the hands-on exercise for 09 Extreme scale AI work, after running the code on 2 nodes, how can monitor the GPU utilization? - You can add -w <target_node> to the srun call that runs rocm-smi to get the GPU utilization for the individual nodes. - I cannot follow this. Can you provide an example command please? - Can you check below, Question 67? That appears to be asking the same question so hopefully the answer there helps you as well. - Lukas 66. For my machine learning model, I need to use the mpi4py library. What is the best way to install mpi4py on LUMI? I tried following the instructions from the slides, but it does not seem to be working for me. - Installing `mpi4py` requires a compilation step of C compilation, which appears to fail because it doesn't find the proper compilers in the containers. There is a base container available that already includes `mpi4py` in `/appl/local/containers/sif-images/lumi-mpi4py-rocm-6.2.0-python-3.12-mpi4py-3.1.6.sif` . Could you try extended that for your needs? If you use `cotainr build` you can replace the `--system=lumi-g` flag with `--base-image=/appl/local/containers/sif-images/lumi-mpi4py-rocm-6.2.0-python-3.12-mpi4py-3.1.6.sif`. - Lukas - nevermind, the mpi4py container doesn't actually work with cotainr :( - Lukas 67. Can you provide an example command please to monitor the GPU utilization for multi-node runs? - You can do the following: Check from `squeue --me` which nodes the job is running on (the node ids are listed in the last column). The you can use the command `srun --overlap --jobid <jobid> -w <nodeid> rocm-smi` to run `rocm-smi` on any of the nodes to check utilisation. Note that this will give you only an instantaneous snapshot of the `rocm-smi` output. You can use `srun --overlap --jobid <jobid> -w <nodeid> --pty bash` to open a terminal on the node and use `watch rocm-smi` to continously monitor the GPU usage. Unfortunately there is no handy tool to monitor GPU utilization across all nodes at once. See also https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/#using-srun-to-check-running-jobs. - Lukas - Thanks! ## Feedback * The time for the hands-on is far too short (+1) * is that in general or for some particular lectures/exercises? * I guess the general problem is that when the lectures go over time (which they often do) that time is subtracted from the exercise time "to keep the schedule" * There is the time now and tomorrow afternoon to catch up with any exercise that is interesting for your work and still receive help... * Did you manage to catch up with the exercises during the open session at the end? * IMHO the AMD/profiling lecture could be moved to an advanced workshop, and then there could be more time for hands-on. * We get many tickets about "why is my AI running only that fast compared to this and that machine" and frankly proper profiling is the only way to find out. Given the amount of compute time that AI projects use on LUMI, learning to use that power efficiently is not an advanced topic but a very basic thing that everybody who does training at some scale should know. Users should realise that the cost of many AI projects is really the price of a house and some of the bigger projects are literally worth millions of EURO and you don't just play with that taxpayer money. So this is why this presentation is in this course and even fairly early on in the course. * IMO, the AMD lecture was useful for _monitoring_. Everyone on a HPC system should know how to monitor the hardware for their job to maximize hardware efficiency by increasing batch size or model complexity. For example, via rocm-smi or nvtop(!). Indeed, profiling is maybe more advanced and could be in an advanced workshop. * Yes, please add an option to `cotainr` to read pip environment.txt files (+1) * The hands-on-workshop is really great and with very valuable information in particular for preparation project. I can only guess how much work is was to prepare all the topics in such details. In particular, providing the reference solutions is excellent to catch-up later on. https://md.sigma2.no/lumi-ai-workshop-feb25?both#First-AI-training-job-on-LUMI * You mention that the SIF files would be deleted from time to time. It would be great if the SIF files used in the workshop could be available from some time (if possible) as this workshop containts so much information that that it will be likely that I need to come back at a later stage. * The project folders on the filesystem will be available for three more months, which includes the containers used in the exercises. If you need them after that, at least the cotainr build files for recreating the container used in Exercises 3 and 8 are available in the Github Repository (in [/bonus_material/exercise_container_recipes/](https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/bonus_material/exercise_container_recipes)). - Lukas * The containers in `/appl/local/containers/sif-images` tend to change or disappear from time to time. We realise that even a change without a change in name, can be annoying as it could break a virtual environment that you built on top of this container. However, for those containers for which we make a matching EasyBuild module, you'll find the container in `/appl/local/containers/easybuild-sif-images` and we keep those copies for as long as we keep the EasyConfigs, which is basically until we expect them to break because the ROCm driver on the system is too new. * Small comment: I prefer the long command line options (e.g. `--ntasks` rather than `-n` for `srun`) for documentation as they are easier to read and to look up. * Good point! We try to always use the long versions, but sometimes forget it. * Thanks for a great workshop. I'm sure this has enabled us to better utilize the project resources we have been allocated, and finding all this stuff out on our own would have taken so, so long! Thanks again! * Thanks for a great workshop. Super happy that I have joined. Some minor points that could improve my understanding: * A good use case for the profiling. I understand that creating the profiling logs in itself is an important skill. However, I feel that interpreting them is also a necessary skill to improve general performance. (+1+1) * At times, the workshop felt a bit rushed. Maybe there are too many topics to cover or too short time for hands-on, ... But then, it was hard to do something useful at the hands-on or ask questions. (+1) * Multi-node lecture (Extreme scale AI) was hard to follow. I think the hands-on task is also pretty hard to do. Even with the reference solution, I struggled a bit to monitor the node activity. (+1) :::info **Please write your feedback above this note** ::: ###### EOF
{}