Finetuning with Quantization#
We support Quantized Parameter-Efficient Fine-Tuning (QPEFT) methods including QNormBias and QNormBiasLoRA, which significantly minimize the computing demands. In QPEFT, we quantize the base model while only retain carefully selected trainable parameters.
QNormBias. Only the bias term and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.
QNormBiasLoRA. The bias term, LoRA weights, and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.
Best Practice#
# Enable quantization with flag "--quant" and "--only_save_trainable"
torchrun <--some_flags> main_finetune.py <--some_flags> \
--quant --only_save_trainable
For more details, please check the following scripts:
Method |
Finetune Language-only LLaMA 2 |
Finetune Multi-Modal LLaMA 2 |
---|---|---|
QNormBias |
- |
|
QNormBiasLoRA |
Comparison#
Models can be loaded in 4-bit NormalFloat (NF4) data format which optimizes both inference and training processes and significantly minimizes VRAM demands. To assess the impact, we performed experiments using the A100-80GB and obtained the following results. The quantization is implemeted by bitsandbytes. Check out the paper to learn more.
BatchSize=1 for fair comparison
Model |
Max Length |
Task/Dataset |
Precision |
Batch Size |
Inference |
Training |
Single GPU |
---|---|---|---|---|---|---|---|
LLaMA2-70B |
512 |
Language-only/Alpaca |
BF16 |
1 |
145 GB |
165 GB (NormBias) |
❌ |
LLaMA2-70B |
512 |
Language-only/Alpaca |
NF4 |
1 |
36 GB |
46 GB (NormBias) |
✔ |
LLaMA2-13B Q-Fomer |
512 |
Multi-modal/LLaVA-Instruct-150K |
BF16 |
1 |
31 GB |
38 GB (NormBiasLoRA) |
✔ |
LLaMA2-13B Q-Fomer |
512 |
Multi-modal LLaVA-Instruct-150K |
NF4 |
1 |
13 GB |
15 GB (NormBiasLoRA) |
✔ |
GPU hours of finetuning
Note that we use 8x A100-80GB GPU cards for finetuning. The GPU hour refers to
number_of_cards * total_training_time
.
Model |
Task / Dataset |
Samples |
Epoch |
Precision |
GPU Hours |
8x A100 Training Time |
---|---|---|---|---|---|---|
LLaMA2-70B |
Language-only/Alpaca |
52K |
4 |
BF16 |
100h |
12.5h |
LLaMA2-70B |
Language-only/Alpaca |
52K |
4 |
NF4 |
80h |
10h |
LLaMA2-13B Q-Former |
Multi-modal/LLaVA-Instruct-150K |
150K |
3 |
BF16 |
170h |
20h |
LLaMA2-13B Q-Former |
Multi-modal/LLaVA-Instruct-150K |
150K |
3 |
NF4 |
88h |
11h |
Inference#
The trainable weights are saved in outdir
when QPEFT is done. Run with following scripts:
Language-only LLaMA2
# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"
torchrun --nproc-per-node=1 demos/single_turn.py \
--llama_type "llama_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/llama> </path/to/trainable/params> \
--quant
Multi-modal LLaMA2
# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"
torchrun --nproc-per-node=1 demos/single_turn_mm.py \
--llama_type "llama_qformerv2_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/multi/modal/llama> </path/to/trainable/params> \
--quant
Check inference.md for more details.