# LUMI Hackathon: SignLlava
Use this collaborative document to plan your steps, discuss questions and take notes.
[Overview page of markdown syntax and HedgeDoc features like embedding pictures, pdf, links](https://md.sigma2.no/features#Table-of-Contents)
## Perfetto locally
URL version https://ui.perfetto.dev/v46.0-35b3d9845/#!/
If you want open multiple logs at once u can specify the port and open multiple files
#### ONLY WORK ON CHROME
```
cd ~
mkdir Perfetto
cd Perfetto
curl -LO https://get.perfetto.dev/trace_processor
chmod +x ./trace_processor
./trace_processor --httpd <path_to_json_trace> --http-port 9002
```
#### Mulriple traces at once Localy
```
./trace_processor --httpd --http-port 9001 trace1.pftrace
./trace_processor --httpd --http-port 9002 trace2.pftrace
./trace_processor --httpd --http-port 9003 trace3.pftrace
```
URL are as follows
https://ui.perfetto.dev/#!/?rpc_port=9001
https://ui.perfetto.dev/#!/?rpc_port=9002
https://ui.perfetto.dev/#!/?rpc_port=9003
### Would be GREAT if that worked on LUMI
## Perffetto on LUMI
```
log on uan02
curl -LO https://commondatastorage.googleapis.com/perfetto-luci-artifacts/v32.0/linux-amd64/trace_processor_shell`
chmod +x trace_processor_shell
#cd ~; find . -iname trace_processor_shell
cp trace_processor_shell ./.local/share/perfetto/prebuilts/trace_processor_shell
#replace `./.local/share/perfetto/prebuilts/trace_processor_shell`
./trace_processor --httpd --http-port=12000 results.json
local
ssh -L 10000:localhost:10000 -L 9001:localhost:12000 -i <ssh_key> <username>@lumi-uan02.csc.fi
CHROMIUM
localhost:10000
```
## XZ LUMI:
compress
`xz -T16 -9 --keep results.json`
unpack:
`xz -d results.json.xz`
## Goals for Hackathon
1. ...
2. ...
## Steps and whose working on what
1. I built new venv for SignLlava like this:
```
#salloc the job
srun --interactive --pty bash
# while in venv directory
singularity exec -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash
cd /workdir
python -m venv --system-site-packages signllava-venv
source signllava-venv/bin/activate
# Install all packages that aren't included in the provided container
pip install librosa wandb h5py opencv-python decord google
# There was an issue with deepspeed and transformers compatibility
# To solve this problem, I downgraded transformers from 4.45.1 to 4.44.2
# Since transformers is already installed in the non-writable container, it installs to home directory by default
# Thus, we need to use --target option
pip install --target /workdir/signllava-venv/lib/python3.12/site-packages transformers==4.44.2
```
2. Llama3-70B with ZeRO3
- Launch
```bash
deepspeed ./scripts/zero3_modified.json --gradient_accumulation_steps 1 --report_to wandb --yaml_args /scratch/project_465000977/eleznyto/SignLLMpp/configs/lumi_hackaton/subset_70b.yaml
```
- What’s different?
- Larger model in yaml
- zero3 confing instead of zero2
- Pay attention to `stage3_prefetch_bucket_size` variable
- Maximum number of parameter elements to fetch ahead of use.
- It is set to `"auto"` in our default config
- This option calculates this number based on the size of the model!
- Not free GPU memory!
- It may result in float number instead of int → error.
- This error still persists in the newest version of accelerate 1.0.1
- Default accelerate value is `"50000000"`
- This value is too high for our model
- If the value is too high, GPUs run out of memory
- Our training works for `"stage3_prefetch_bucket_size": 10000000` and `per_device_train_batch_size: 4` so far - those parameters should be tuned for optimal training
- More information here:
https://huggingface.co/docs/accelerate/usage_guides/deepspeed
## Notes
- ...
- ...
Starting a container with local mount to the current directory
```
srun --interactive --pty bash
singularity exec -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash
$WITH_CONDA
python -m venv --system-site-packages signllava-venv
source signllava-venv/bin/activate
pip install wandb
singularity run -B $(pwd):/workdir /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash -c '$WITH_CONDA; source /workdir/signllava-venv/bin/activate ; python -c "import wandb ; print(\"Hi!\")"'
PYTHONPATH=/workdir/signllava-venv/lib/python3.12/site-packages singularity run -B $(pwd):/workdir -B /pfs/lustrep3/scratch/project_462000394:/pfs/lustrep3/scratch/project_462000394 /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif bash -c '$WITH_CONDA ; python -c "import wandb ; print(\"Hi!\")"'
```
LD_LIBRARY_PATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib
PYTHONPATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib/python/site-packages
PATH=/pfs/lustrep3/scratch/project_462000394/amd-sw/omnitrace/1.12.0-rocm6.1.x/bin
OMNITRACE USE
module use /appl/local/containers/test-modules/
module load omnitrace
### AI quetionnaire
If you can spare 5 min here is the link to LUMI AI questionnaire: https://link.webropolsurveys.com/Participation/Public/c62ffb41-714a-4425-aa37-69634dc22419?displayId=Fin3151145
### SLURM Run - omnitrace profiling
run_job.sh
```
#!/usr/bin/env -S bash -e
#SBATCH --job-name=run_profile_node1
#SBATCH --nodes=1
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=8
#SBATCH --mem=0
#SBATCH --output="/pfs/lustrep2/scratch/project_465000977/profile/logs_profile/log_%x_%j.txt"
#SBATCH --partition=standard-g
#SBATCH --time=1:00:00
#SBATCH --account=project_465001361
#SBATCH --reservation=lumi_hackathon
export EBU_USER_PREFIX=/project/project_465000977/EasyBuild
module load CrayEnv
SIF=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3-python-3.12-pytorch-v2.4.1.sif
PROJECT_PATH=/scratch/project_465000977
VENV_PATH=/scratch/project_465000977/eleznyto/venvs
HOME_PATH=/pfs/lustrep3/users/strakaja/Projects
OMNITRACE=/pfs/lustrep3/scratch/project_462000394
export MASTER_ADDR=$(scontrol show hostname "$SLURM_NODELIST" | head -n1)
export MASTER_PORT=50000
srun --cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \
singularity exec -B $OMNITRACE,$HOME_PATH,${HOME_PATH}/Sign_LLaVA,$PROJECT_PATH --bind /pfs,/scratch,/projappl,/project,/flash,/appl -B $VENV_PATH:/venvs $SIF \
bash /pfs/lustrep2/scratch/project_465000977/profile/run/profile_runs/profile_train_python_128.sh
```
profile_train_python_128.sh
```
$WITH_CONDA
source /venvs/signllava-venv/bin/activate
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
PROJECT_PATH=/scratch/project_465000977
OMNITRACE=/pfs/lustrep3/scratch/project_462000394
PATH=${OMNITRACE}/amd-sw/omnitrace/1.12.0-rocm6.1.x/bin:${PATH}
LD_LIBRARY_PATH=${OMNITRACE}/amd-sw/omnitrace/1.12.0-rocm6.1.x/lib:${LD_LIBRARY_PATH}
export PYTHONPATH=$PYTHONPATH:/venvs/signllava-venv/lib/python3.12/site-packages/
cd /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA
export PYTHONPATH=$PYTHONPATH:$(pwd)/
export LC_ALL=C
# export WANDB_PROJECT="LumiHackaton"
# export WANDB_API_KEY=
export RANK=$SLURM_PROCID
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$SLURM_NPROCS
# Set interfaces to be used by RCCL.
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export NCCL_NET_GDR_LEVEL=PHB
export OMNITRACE_CONFIG_FILE=/pfs/lustrep3/users/strakaja/Projects/configs/omnitrace-config.cfg
export OMNITRACE_OUTPUT_PATH="${PROJECT_PATH}/profile/outputs_profile/omnitrace/${SLURM_JOB_NAME}"
omnitrace-python -- -u llava/train/train_xformers.py \
--local_rank=$LOCAL_RANK \
--deepspeed ./scripts/zero2.json \
--gradient_accumulation_steps 1 \
--report_to none \
--yaml_args /pfs/lustrep3/users/strakaja/Projects/configs/subset_128.yaml
OUTPUT_PATH="${PROJECT_PATH}/profile/outputs_profile/rocprof/${SLURM_JOB_NAME}"
TEMP_PATH="${OUTPUT_PATH}/N${SLURM_NODEID}_L${SLURM_LOCALID}"
mkdir -p $OUTPUT_PATH
mkdir -p $TEMP_PATH
cd $TEMP_PATH
rocprof --hip-trace \
-d $OUTPUT_PATH \
python -u /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA/llava/train/train_xformers.py \
--local_rank=$LOCAL_RANK \
--deepspeed /pfs/lustrep3/users/strakaja/Projects/Sign_LLaVA/scripts/zero2.json \
--gradient_accumulation_steps 1 \
--report_to none \
--yaml_args /pfs/lustrep3/users/strakaja/Projects/configs/subset_128.yaml
```
### AMD memory alloc tracking
- Setup the `AMD_LOG_LEVEL` variable
- `export AMD_LOG_LEVEL=3` for our case
- more information here:
https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/logging.html
- This will print out A LOT of AMD information
- For us, only allocation is useful
- We’ll set the output file of our script
- We can use `python our_script.py >output_file.out`
- However, this will output only the “output” part, not the “error” one which stores AMD information
- Thus, we also need to redirect error like this:
- `python our_script.py >output_file.out 2>&1`
- Now we have file with both output and error. This file is usually very large
- Extract only “alloc” part from the output. We also want to include part of standard output to track when those allocations happen.
- We’ll track only those lines which contains “%” - to track iterations progress bar
- Save the result to `output_file.processed` file
- `grep -E 'lloc|\%' output_file.out > output_file.processed`
- It’s good to check file size before opening them
- `ls -lh`
- Now we can see individual allocations and how long did they take