9 views
# Moving your AI training jobs to LUMI Workshop 29.--30.5.2024 9:00--16:30 (CEST), 10:00-17:30 (EEST) Copenhagen & online :::danger __This is Q&A document of the May 2024 workshop.__ The current one is here: https://md.sigma2.no/lumi-ai-workshop-nov24?both ::: --- [TOC] ## General Information - Link to this document: [https://md.sigma2.no/lumi-ai-workshop-may24](https://md.sigma2.no/lumi-ai-workshop-may24?both) - [Schedule](#Schedule) - **Zoom link: https://cscfi.zoom.us/j/65207108811?pwd=Mm8wZGUyNW1DQzdwL0hSY1VIMDBLQT09** - On-site at NORDUNets offices in Copenhagen: Kastruplundgade 22,1st floor DK-2770 Kastrup DENMARK ## Schedule (all times in CEST) ### **Wednesday, 29.5.2024** <details> <summary>Click me</summary> - __**09:00**__ -- Welcome, introduction to the course - __**09:15**__ -- Introduction to LUMI & LUMI-G architecture for AI training - How LUMI differs from other clusters - AMD GPUs instead of NVIDIA - Slingshot Interconnect - __**09:45**__ -- Using the LUMI web-interface - Introduction to the Open OnDemand web interface - Using PyTorch in JupyterLab on LUMI - Limitations of the interactive interface - __**10:05**__ -- Hands-on: Run a simple PyTorch example notebook - __**10:35**__ -- **Break** - __**10:50**__ -- Your first AI training job on LUMI - Using LUMI via the command line - Submitting and running AI training jobs using the batch system - __**11:30**__ -- Hands-on: Run a simple single-GPU PyTorch AI training job - __**12:05**__ -- **Lunch break** - __**12:50**__ -- Understanding GPU activity & checking jobs - Checking jobs with rocm-smi & rocprof - __**13:15**__ -- Hands-on: Checking GPU usage interactively using rocm-smi - __**13:30**__ -- Introduction to running Singularity containers on LUMI - What is a container and what can it do - How do you use containers - Mounting filesystem into containers - Official LUMI (FakeCPE) containers - __**13:50**__ -- Hands-on: Pull and run a container - __**14:10**__ -- **Break** - __**14:25**__ -- Review of last exercise - __**14:30**__ -- Converting your conda/pip AI environment to a container using cotainr - Containers from conda/pip environments - Recipes for PyTorch, Tensorflow, and JAX/Flax on LUMI - __**15:05**__ -- Hands-on: Creating a conda environment file and building a container using cotainr - __**15:20**__ -- Extending containers with virtual environments for faster testing - __**15:45**__ -- Getting Started with your own project - Apply what you have learned to your own code - __**16:45**__ -- **End of the course day** </details> ### Thursday, 30.5.2024 <details> <summary>Click me</summary> - __09:00__ -- Scaling AI training to multiple GPUs - How to run on multiple GPUs on LUMI with distributed PyTorch and higher level libraries like Huggingface. - Common issues and solutions - __09:45__ -- Hands-on: Converting the PyTorch single GPU AI training job to use all GPUs in a single node via DDP. - __10:10__ -- Hyper-parameter tuning using Ray on LUMI - __10:25__ -- Hands-on: Hyper-parameter tuning the PyTorch model using Ray - __10:45 -- Break__ - __11:05__ -- Extreme scale AI - Model parallelism on LUMI via FSDP or DeepSpeed, moving to multiple nodes and optimizing for the network. - __11:35__ -- Demo/Hands-on: Using multiple nodes - __12:10 -- Lunch break__ - __13:00__ -- Loading training data from Lustre and LUMI-O - Filesystems and object storage, best practices for AI data - __13:20__ -- Coupling machine learning with HPC simulation - Coupling workflows, interoperability, frameworks and SmartSim. - Examples. - __13:30__ -- Advancing your own project - Bring your own code, apply what you have learned with support from instructors - **Breaks as needed** - __**16:00 -- End of the course day**__ </details> ## Events ### Next public HPC coffee break **26.06.24, 13:00--13:45 (CEST), 14:00--14:45(EEST)** Meet the LUMI user support team, discuss problems, give feedback or suggestions on how to improve services, and get advice for your projects. Every last Wednesday in a month. [Join via Zoom](https://cscfi.zoom.us/j/68857034104?pwd=UE9xV0FmemQ2QjZiQVFrbEpSSnVBQT09) ### Performance analysis and optimization **11.--12.06.24, 9:00--16:30 (CEST)** - 2 day workshop about finding and fixing performance bottle necks. Participants are encouraged to bring their own workflows - Only on-site in Oslo (Norway) - __[Register here ](https://www.lumi-supercomputer.eu/events/performance-analysis-and-optimization-workshop-2024/)until 30.5.2024__ ## Slides, exercises & recordings The training material including recordings are published in the [LUMI Training Materials archive](https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/). These pages also contain pointers to the recordings and a permanent place where the slides will be stored, also when the training project ends. The exercises can be found in this GitHub repository: https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop There are two Slurm reservations for the course. One for each day: - First day: `AI_workshop` (on the small-g Slurm partition) - Second day: `AI_workshop_2` (on the standard-g Slurm partition # Q&A of day 1 https://md.sigma2.no/lumi-ai-workshop-may24-archive?view # Q&A of day 2 - How to log multiple gpu progress in Wandb? We are used to log training progress while on single gpu. However, when we use whole node with 8 gpus, the standard logging creates different log for each gpu - i.e. it looks like 8 different runs. ## Icebreaker: What kind of training data do you use and how much and how many files does it consist of? - 4-8 T climate dataset - Image data - 1st dataset 1.5Tb+- 11K videos, 2nd 100Gb 31K videos (downscaled) - NLP for low-resource languages, so not much, by definition. But model sizes are large :-\) - On the order of 2..8 GB, usually in 1 file, sometimes few (<10) separate files - Apache Spark HTTP server log files which will be converted(manipulated) into corpus of ~4.7 TB language data from historical newspaper collection provided by National Library of Finland. - 10k - 1M chemical formulas in 1 file - Usually several files, up to hundreds of GB each - Speech data for pre-training acoustic models: 130k hours of audio, 17M small files stored in 1GB binary files, total size 12TB - 16 GB of images - Materials data, three files total less than 2GB. Some scripts may be less than 50MB. Model will generate some files around 5GB. - ~150 GB of 2D atmospheric data. 30 years of daily data, about 65K files, but compressed to about 6 zarr files. - 400ish 1GB .json files ## Scaling AI training to multiple GPUs 1. Do we get some sort of certificate for this workshop to be able to provide it for our home institution to obtain some kinda of ECTS credits? - Sure, you can get a certificate, please send us a request thorugh our ticketing system: https://lumi-supercomputer.eu/user-support/need-help/general/ 2. If I reserve one GCD and 14 CPU cores, will I only be billed for one GCD? - No, you will be billed for 2 GCDs since your share of requested CPU cores corresponds to 14/56 = 1/4 of the node (2 CCDs out of 8). The same principle applies if you request more han 1/8 of the memory for a single GCD. More details on the [LUMI Docs billing policy page](https://docs.lumi-supercomputer.eu/runjobs/lumi_env/billing/#gpu-billing). Basically the policy is that you get billed for everything that another user cannot use in a normal way because of the way you use the machine. So if you take a larger share of a particular resource (GCDs, CPU cores and CPU memory), that will be the basis on which you are billed as you can no longer fill up the node with other users who only ask a fair share of each resource. 3. If I use PyTorch DistributedDataParallel on LUMI, do I still need to specify NCCL and not RCCL as backend? (The slide says `torch.distributed.init_process_group(backend='nccl')`) - Yes, PyTorch uses the Nvidia terminology independently of whether you use AMD or Nvidia GPUs. If your PyTorch has been built against ROCm instead of CUDA, setting `torch.distributed.init_process_group(backend='nccl')` results in RCCL being used for communication. - The underlying reason is that AMD could not exactly copy CUDA because that is proprietary and protected technology. However, they could legally make a set of libraries where function calls have a different name but the same functionality. This is how HIP came to be. It mimics a large part of the CUDA functionality with functions that map one-to-one on CUDA functions to the extent that you can still compile your HIP code for NVIDIA GPUs by just adding a header file that converts those function names back to the CUDA ones. Similary, several libraries in the ROCm ecosystem just mimic NVIDIA libraries, and this is why PyTorch treats them the same. 4. In checking that we use all GPUs, do we primarily check the power oscillating around that recommended value ( 300-500 W) or psutil? Maybe I mix things... - I think the first step is to check the GPU utilization, the GPU% part. It should be higher than 0, and also the GPU memory allocated should be higher than zero. Then if those things are OK, check the power as well, as that's a better indication of it doing something useful as well. ### Hands-on :::info https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/08_Scaling_to_multiple_GPUs - Focus for now, changing batch size in script and change slurm run - Because we don't have that much time, ignore the cosmetic part about output printing - Focus on the following: - Exercise Part 1: Adjust the per_device_batch_size in the TrainingArguments to split the full batch over the 8 devices. - ignore the parts about the device selection and output filtering - Exercise Part 2: Adjust the Slurm script (run.sh) to launch training on 8 GPUs (single node) with torchrun as shown in the slides. - Use the reservation: - `#SBATCH --reservation=AI_workshop_2` - `#SBATCH --partition=standard-g` ::: ### Did you manage to finish the exercise? (Put an "x") Yes: xxxxx No : xxxxxxxxxxx Partially: xxxxxxxxxxxxxxxx I'm having problems: - Training works, I didn't have time to ensure that all GPU and CPU bindings match - 62%|██████▎ | 5/8 [00:01<00:01, 2.99it/s]Memory access fault by GPU node-4 (Agent handle: 0x559149496f30) on address 0xb3a8c52df000. Reason: Unknown. [2024-05-30 11:07:00,623] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 41600 closing signal SIGTERM - Since there was not enought time it was confusing, because the task deviated too much from the instructions on github. - I get an error (which kills the run) about the perplexity metric. Probably I could suppress it by telling it to ignore NaNs, but I'm not sure why PPL would be giving NaNs at all. - ``` RuntimeError: No best trial found for the given metric: perplexity.``` ## Hyper-parameter tuning using Ray ### Hands-on :::info https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/09_Hyper-parameter_tuning_using_Ray_on_LUMI Don't forget to use the reservation with `#SBATCH --reservation=AI_workshop_2` (works only on `standard-g`) ::: ### Did you manage to finish the exercise? (Put an "x") Yes: xxxxxxx No : Partially: xxxxx I'm having problems: 5. I run to a problem with some parameters ``` Traceback (most recent call last): File "/pfs/lustrep4/scratch/project_465001063/palltaav/Getting_Started_with_AI_workshop/09_Hyper-parameter_tuning_using_Ray_on_LUMI/GPT-neo-ray-tune.py", line 199, in <module> best_trial = results.get_best_result("perplexity", "min", "last") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/ray/tune/result_grid.py", line 161, in get_best_result raise RuntimeError(error_msg) RuntimeError: No best trial found for the given metric: perplexity. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filter_nan_and_inf` arg to False. srun: error: nid005124: task 0: Exited with exit code 1 ``` - (Lukas) It seems something might have went wrong with the actual hyperparameter exploration jobs; can you check your logs if there are earlier messages that indicate failure during training? - indeed, it's pretty extensive, ends with something like that: ``` File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder builder_instance: DatasetBuilder = builder_cls( ^^^^^^^^^^^^ File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/datasets/packaged_modules/cache/cache.py", line 122, in __init__ config_name, version, hash = _find_hash_in_cache( ^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/datasets/packaged_modules/cache/cache.py", line 48, in _find_hash_in_cache raise ValueError( ValueError: Couldn't find cache for imdb for config 'default' Available configs in the cache: ['plain_text'] ``` - (Lukas): Can you confirm that your `run.sh` still properly sets the cache paths for Huggingface: ``` SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" FLASH="/flash/${SLURM_JOB_ACCOUNT}" export TORCH_HOME=$SCRATCH/torch-cache export HF_HOME=$FLASH/hf-cache mkdir -p $TORCH_HOME $HF_HOME ``` If they are correct, try just re-running the job, maybe there was some hiccup earlier. - yes these lines are still there, there is something wrong with data. - (Lukas): I just ran this using the same cached data and it worked. Can you try again? - yes I will try: now it seems to run alright! 6. In `run.sh` in the reference solution, memory is specified to be 0 (`#SBATCH --mem=0`). Why does this work? Why do we not need to specify e.g. 480G? - (Gregor) `--mem=0` will use all the available memory on the node, it's just a faster way of specifying "use all the available memory on the node". `--mem=480G` works as well. - (Kurt) but it is actually better to specify that you want 480G as there is a subtle difference: Asking for 480G guarantees that you will get 480G. You can get a lot less with `--mem=0` if, e.g., due to a memory leak in the system software - which has happened - more memory is consumed by the OS. So asking `--mem=480G` guarantees you a node that is healthy at least in that respect. ## Extreme scale AI :::info There are some reaons why the situation with the binding is not better: - Slurm GPU binding is broken and uses a technique that breaks communication with RCCL and GPU-aware MPI - One change that would offer some improvement is system-wide and would have a large impact on LUMI-C also, necessitating retraining all users. - It is not easy either because you can only do something with few environment variables or a pre-made script if every user would nicely request 7 cores per GPU requested. On `small-g` you now basically have a lot of fragmentation. I'm not sure if "CPU" could be redefined in Slurm to mean "1 CCD" but that might very well be the change that sysadmins told me would also be in effect on LUMI-C where it would not be appreciated by users. ::: :::info LUST is basically an L1 and basic L2 help desk. Optimisation questions are really something for advanced L3 support and not something that we can regularly deal with. With the size of the team we have and all the other tasks, the average amount of time that we can spent per project on support, is less than 1 day. The EuroHPC EPICURE project (lead by CSC) may help a bit, for now only for EuroHPC projects, and most people working on the project are also just experts in their respective machines but no specific domain knowledge on AI (or most other fields). And EuroHPC is working on establishing an AI centre of excellence that should really build up the knowledge for such questions. ::: ### Hands-on :::info https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/tree/main/10_Extreme_scale_AI ::: #### Did you manage to finish the exercise? (Put an "x") Yes: xxxxx No : xxxxxxxxxxx Partially: xxxxxxxxxxxxxxxxx I'm having problems: 7. When executing the command ``` MASTER_ADDR=$(scontrol show hostname "$SLURM_NODELIST" | head -n1) \srun -N1 -n8 --gpus 8 \ --cpu-bind=mask_cpu=0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000000fe000000,0x00000000000000fe,0x000000000000fe00,0x000000fe00000000,0x0000fe0000000000\ singularity exec \ -B .:/workdir \ /scratch/project_465001063/containers/pytorch_transformers.sif \ /workdir/run.sh \ python -u /workdir/GPT-neo-IMDB-finetuning-mp.py \ --model-name gpt-imdb-model \ --output-path /workdir/train-output \ --logging-path /workdir/train-logging \ --num-workers 7 ``` What should be in the `run.sh` script? It should be different then before I guess, since we now call the python-script directly... ## Loading training data from Lustre and LUMI-O :::info Two nice things to know about LUMI-O - We actually use it during this course to serve you the slides and the videos. Though it is not meant to be a web server. - As the LUMI-O software is done by a different team at CSC and not by HPE, it is often still up when LUMI is down. We cannot give a guarantee, but when a long downtime is announced, in the past LUMI-O was still available almost the whole downtime. So you may still be able to access data on LUMI-O, but not on the Lustre file systems when LUMI is down for maintenance. But it is not meant for long-time data archiving. Storage on LUMI-O also disappears 90 days after your project ends. For long-term archiving and data publishing you need to use specialised services. ::: :::info Clean-up is not yet implemented on LUMI because until now there hasn't been a need to do so as the storage is empty enough. The limited size of /project is also because CSC wants to avoid that LUMI is used for long-term data storage. The idea is indeed that data is stored longtime on LUMI-O and transported to /scratch or /flash as needed as the assumption was that the whole dataset is rarely needed at the same time. Note that asking for more quota doesn't make sense if your project doesn't have the necessary storage billing units. Storing 20TB for one year on /scratch or /project would cost you 175,200 TB hours, so make sure you have enough storage billing units. There is enough storage on LUMI that resource allocators can grant decent amounts of storage, but it is not infinite. LUST cannot grant you storage billing units, that is something you need to negotiate with the instance that granted you your project on LUMI. ::: ## Coupling machine learning with HPC simulation 8. The last part about fluid dynamics Eddies was very interesting. We are already looking into possible participants for seminar, possibly arranged by CSC. Lets make contact and discuss afterwards. - Great to hear. Send us a ticket and we will make something happen ## General Q&A # Feedback: What was most interesting you learned in this course? - Maybe more introductory lectures (practical) addressing containerization and array sbatch jobs with multi GPUs would have been interesting along with thier corresponding hands-on excercises while we do some end-to-end training of DL-based (mini) projects. (+3) - Coupling machine learning with HPC simulation. It might be nice to show some examples - It was nice to learn the specifics of Lumi (as opposed to other shared computers), which will mainly save me time when i transfer over to lumi. Also, I really liked the profiling bit, scaling to multiple GPUs, and hyperparam tuning. Also i'm VERY GLAD FOR all of the resources to take home for me! I'll certainly revisit them when I need them. - Thanks to the team for this awesome workshop. The containerization techniques and advanceed multi-gpu training are super benifitial for me. (+2) - All of the lectures were super clear, well structured and relevent and covered what seems to be every necessary aspect. The training materials are well structured too and also very, very useful! The most useful part for me was understanding how to increase the efficency of code to, eg, maximise the full potential of reserved resources. Thank you so much. - running AI jobs on LUMI and scaling to multiple GPUs - Nearly everything was so interesting that it is hard to pick out a single thing. Thank you for this excellent workshop - it certainly did not feel like this were the first time it was being held. - More lectures and hands-on exercises on how to benefit parrallelization in LUMI with asynchronous operation(s). - (Lukas) can you elaborate on what kind of asynchronous operations you mean? - The subject of classic simulation combined with AI was highly interesting. The subject is being discussed frequently at CSC and we plan to arrange a webinar. I should connect with the speaker of the subject. - The course (on zoom) it was for me an introduction to supercomputing as a beginner. The feeling afterwards is that it is possible with some self-study to do everything taught in the course. The attention was unfortunately shared with other work (parallelism, haha) so of the exercises I managed only the first parts. The course gave an understanding of what points needs to be understood in order to be self-going on a supercomputer. With the recipies and documentation presented it feels possible to run a simulation+AI (CFD maybe) or an LLM tuning on LUMI with some more work investment. If the course-format on zoom is not given more interaction, I will recommend going to the course on-site. I hope I didnt steel too much GPU time from real research these two days! ;) Thanks to you all for arranging this. - We actually would like to have more people on-site and less online as the contact with the experts is as important as the content of the course and as it is not possible to simply give a Zoom course with the same level of interaction, and actually more expensive also as more people are needed and you need to bring them together anyway as coordinating for answers between people over Zoom is simply not possible. Anybody who gave courses to a group of more than 10 or so people during COVID will know they really needed an assistant in the next room to pass on questions etc. - Yes, certainly your onsite focused arrangment was understood from the start. No disappointment whatsoever! However, these complete courses would be great also with online focus, either real-time or "continously ongoing". - I found the hand-on exercises to be really valuable, it makes it much easier to get the important things right (esp. in case of in-person workshop). It's easier to miss details when just reading through the docs. Hard to pinpoint specific things that to highlight out of the course. For me personally, I am now sold on cotainr, while I was skeptical of it before. - I really appreciated the course. Thank you all for arranging this and thank all the help for my questions! I have spotted pitfalls and revised the problems I faced before and beyond. I will certainly use the examples and take-aways and revisit the slides in my practice. There are many things I used already but not clear before, now it is very clear to me why they are so. I found it quite helpful to attend this course in person. I hope this kind of courses will happen more often in Copenhagen, or at least in Denmark. :-) Specifically, I have several comments and rather compliments as the following - Personally, I found the session 10 -extreme scale AI most pertinent and useful for my applications. I believe it would be also worthwhile in the future to have it extended to be an individual course, just to help the users to get their codes optimized for running at scale with the special skillset (I am definitely in need of this every now and then, and it is difficult to get sufficient help on this matter usually). - I realized I was running my jobs on LUMI in a sub-par manner, because of a lot of mis-conceptions, such as each node has 8 GPUs rather than 4, even after I read documentations, I guess there are certain things just slip away if one is not an expert. But this course certianly helps to clear it up. - I found the last session quite interesting. I guess it is very useful for machine learning in projects such as climate change or biology (not my fields). I general I found the course very useful. For me there was not enough time for the hands-on exercises (But it is also affected by that it is my first time on LUMI so I'm not used to navigating on the system i general). But having the reference solutions are very useful. I think you should consider re-structuring the python file of the the case that you used. Maybe cleaning away all model and dataset handling to other modules or functions. It could make it more easily-readable and focus on the right parts. Maybe find inspiration in the boilerplate code pytorch provides for their ddp tutorial (https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu_torchrun.py). Here models and data are loaded on line 78 and 79. :::info **Please write your feedback above this note** ::: ###### EOF