LUMI Hackathon: SignLlava

# LUMI Hackathon: SignLlava Use this collaborative document to plan your steps, discuss questions and take notes. [Overview page of markdown syntax and HedgeDoc features like embedding pictures, pdf, links](https://md.sigma2.no/features#Table-of-Contents) ## Perfetto locally URL version https://ui.perfetto.dev/v46.0-35b3d9845/#!/ If you want open multiple logs at once u can specify the port and open multiple files #### ONLY WORK ON CHROME ``` cd ~ mkdir Perfetto cd Perfetto curl -LO https://get.perfetto.dev/trace_processor chmod +x ./trace_processor ./trace_processor --httpd <path_to_json_trace> --http-port 9002 ``` #### Mulriple traces at once Localy ``` ./trace_processor --httpd --http-port 9001 trace1.pftrace ./trace_processor --httpd --http-port 9002 trace2.pftrace ./trace_processor --httpd --http-port 9003 trace3.pftrace ``` URL are as follows https://ui.perfetto.dev/#!/?rpc_port=9001 https://ui.perfetto.dev/#!/?rpc_port=9002 https://ui.perfetto.dev/#!/?rpc_port=9003 ### Would be GREAT if that worked on LUMI ## Perffetto on LUMI ``` log on uan02 curl -LO https://commondatastorage.googleapis.com/perfetto-luci-artifacts/v32.0/linux-amd64/trace_processor_shell` chmod +x trace_processor_shell #cd ~; find . -iname trace_processor_shell cp trace_processor_shell ./.local/share/perfetto/prebuilts/trace_processor_shell #replace `./.local/share/perfetto/prebuilts/trace_processor_shell` ./trace_processor --httpd --http-port=12000 results.json local ssh -L 10000:localhost:10000 -L 9001:localhost:12000 -i <ssh_key> <username>@lumi-uan02.csc.fi CHROMIUM localhost:10000 ``` ## XZ LUMI: compress `xz -T16 -9 --keep results.json` unpack: `xz -d results.json.xz` ## Goals for Hackathon 1. ... 2. ... ## Steps and whose working on what 1. I built new venv for SignLlava like this: ``` #salloc the job srun --interactive --pty bash # while in venv directory singularity exec -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash cd /workdir python -m venv --system-site-packages signllava-venv source signllava-venv/bin/activate # Install all packages that aren't included in the provided container pip install librosa wandb h5py opencv-python decord google # There was an issue with deepspeed and transformers compatibility # To solve this problem, I downgraded transformers from 4.45.1 to 4.44.2 # Since transformers is already installed in the non-writable container, it installs to home directory by default # Thus, we need to use --target option pip install --target /workdir/signllava-venv/lib/python3.12/site-packages transformers==4.44.2 ``` 2. Llama3-70B with ZeRO3 - Launch ```bash deepspeed ./scripts/zero3_modified.json --gradient_accumulation_steps 1 --report_to wandb --yaml_args /scratch/project_465000977/eleznyto/SignLLMpp/configs/lumi_hackaton/subset_70b.yaml ``` - What’s different? - Larger model in yaml - zero3 confing instead of zero2 - Pay attention to `stage3_prefetch_bucket_size` variable - Maximum number of parameter elements to fetch ahead of use. - It is set to `"auto"` in our default config - This option calculates this number based on the size of the model! - Not free GPU memory! - It may result in float number instead of int → error. - This error still persists in the newest version of accelerate 1.0.1 - Default accelerate value is `"50000000"` - This value is too high for our model - If the value is too high, GPUs run out of memory - Our training works for `"stage3_prefetch_bucket_size": 10000000` and `per_device_train_batch_size: 4` so far - those parameters should be tuned for optimal training - More information here: https://huggingface.co/docs/accelerate/usage_guides/deepspeed ## Notes - ... - ... Starting a container with local mount to the current directory ``` srun --interactive --pty bash singularity exec -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash $WITH_CONDA python -m venv --system-site-packages signllava-venv source signllava-venv/bin/activate pip install wandb singularity run -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash -c '$WITH_CONDA; source /workdir/signllava-venv/bin/activate ; python -c "import wandb ; print(\"Hi!\")"' PYTHONPATH=/workdir/signllava-venv/lib/python3.12/site-packages singularity run -B $(pwd):/workdir -B /pfs/lustrep3/scratch/project_462000394:/pfs/lustrep3/scratch/project_462000394 /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash -c '$WITH_CONDA ; python -c "import wandb ; print(\"Hi!\")"' ``` LD_LIBRARY_PATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib PYTHONPATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib/python/site-packages PATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/bin OMNITRACE USE module use /appl/local/containers/test-modules/ module load omnitrace ### AI quetionnaire If you can spare 5 min here is the link to LUMI AI questionnaire: https://link.webropolsurveys.com/Participation/Public/c62ffb41-714a-4425-aa37-69634dc22419?displayId=Fin3151145 ### SLURM Run - omnitrace profiling run_job.sh ``` #!/usr/bin/env -S bash -e #SBATCH --job-name=run_profile_node1 #SBATCH --nodes=1 #SBATCH --tasks-per-node=8 #SBATCH --cpus-per-task=7 #SBATCH --gpus-per-node=8 #SBATCH --mem=0 #SBATCH --output="/pfs/lustrep2/scratch/project_465000977/profile/logs_profile/log_%x_%j.txt" #SBATCH --partition=standard-g #SBATCH --time=1:00:00 #SBATCH --account=project_465001361 #SBATCH --reservation=lumi_hackathon export EBU_USER_PREFIX=/project/project_465000977/EasyBuild module load CrayEnv SIF=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif PROJECT_PATH=/scratch/project_465000977 VENV_PATH=/scratch/project_465000977/eleznyto/venvs HOME_PATH=/pfs/lustrep3/users/strakaja/Projects OMNITRACE=/pfs/lustrep3/scratch/project_462000394 export MASTER_ADDR=$(scontrol show hostname "$SLURM_NODELIST" | head -n1) export MASTER_PORT=50000 srun --cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \ singularity exec -B $OMNITRACE,$HOME_PATH,${HOME_PATH}/Sign_LLaVA,$PROJECT_PATH --bind /pfs,/scratch,/projappl,/project,/flash,/appl -B $VENV_PATH:/venvs $SIF \ bash /pfs/lustrep2/scratch/project_465000977/profile/run/profile_runs/profile_train_python_128.sh ``` profile_train_python_128.sh ``` $WITH_CONDA source /venvs/signllava-venv/bin/activate export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID} export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH} PROJECT_PATH=/scratch/project_465000977 OMNITRACE=/pfs/lustrep3/scratch/project_462000394 PATH=${OMNITRACE}/amd-sw/omnitrace/1.12.0-rocm6.1.x/bin:${PATH} LD_LIBRARY_PATH=${OMNITRACE}/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib:${LD_LIBRARY_PATH} export PYTHONPATH=$PYTHONPATH:/venvs/signllava-venv/lib/python3.12/site-packages/ cd /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA export PYTHONPATH=$PYTHONPATH:$(pwd)/ export LC_ALL=C # export WANDB_PROJECT="LumiHackaton" # export WANDB_API_KEY= export RANK=$SLURM_PROCID export LOCAL_RANK=$SLURM_LOCALID export WORLD_SIZE=$SLURM_NPROCS # Set interfaces to be used by RCCL. export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 export NCCL_NET_GDR_LEVEL=PHB export OMNITRACE_CONFIG_FILE=/pfs/lustrep3/users/strakaja/Projects/configs/omnitrace-config.cfg export OMNITRACE_OUTPUT_PATH="${PROJECT_PATH}/profile/outputs_profile/omnitrace/${SLURM_JOB_NAME}" omnitrace-python -- -u llava/train/train_xformers.py \ --local_rank=$LOCAL_RANK \ --deepspeed ./scripts/zero2.json \ --gradient_accumulation_steps 1 \ --report_to none \ --yaml_args /pfs/lustrep3/users/strakaja/Projects/configs/subset_128.yaml OUTPUT_PATH="${PROJECT_PATH}/profile/outputs_profile/rocprof/${SLURM_JOB_NAME}" TEMP_PATH="${OUTPUT_PATH}/N${SLURM_NODEID}_L${SLURM_LOCALID}" mkdir -p $OUTPUT_PATH mkdir -p $TEMP_PATH cd $TEMP_PATH rocprof --hip-trace \ -d $OUTPUT_PATH \ python -u /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA/llava/train/train_xformers.py \ --local_rank=$LOCAL_RANK \ --deepspeed /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA/scripts/zero2.json \ --gradient_accumulation_steps 1 \ --report_to none \ --yaml_args /pfs/lustrep3/users/strakaja/Projects/configs/subset_128.yaml ``` ### AMD memory alloc tracking - Setup the `AMD_LOG_LEVEL` variable - `export AMD_LOG_LEVEL=3` for our case - more information here: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/logging.html - This will print out A LOT of AMD information - For us, only allocation is useful - We’ll set the output file of our script - We can use `python our_script.py >output_file.out` - However, this will output only the “output” part, not the “error” one which stores AMD information - Thus, we also need to redirect error like this: - `python our_script.py >output_file.out 2>&1` - Now we have file with both output and error. This file is usually very large - Extract only “alloc” part from the output. We also want to include part of standard output to track when those allocations happen. - We’ll track only those lines which contains “%” - to track iterations progress bar - Save the result to `output_file.processed` file - `grep -E 'lloc|\%' output_file.out > output_file.processed` - It’s good to check file size before opening them - `ls -lh` - Now we can see individual allocations and how long did they take