Finetuning with Quantization

Contents

Finetuning with Quantization#

We support Quantized Parameter-Efficient Fine-Tuning (QPEFT) methods including QNormBias and QNormBiasLoRA, which significantly minimize the computing demands. In QPEFT, we quantize the base model while only retain carefully selected trainable parameters.

QNormBias. Only the bias term and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.
QNormBiasLoRA. The bias term, LoRA weights, and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.

Best Practice#

# Enable quantization with flag "--quant" and "--only_save_trainable"
torchrun <--some_flags> main_finetune.py <--some_flags> \
--quant --only_save_trainable

For more details, please check the following scripts:

Method	Finetune Language-only LLaMA 2	Finetune Multi-Modal LLaMA 2
QNormBias	alpaca_llamaPeft_normBias_QF.sh	-
QNormBiasLoRA	alpaca_llamaPeft_normBiasLora_QF.sh	alpacaLlava_llamaQformerv2Peft_QF_13B.sh

Comparison#

Models can be loaded in 4-bit NormalFloat (NF4) data format which optimizes both inference and training processes and significantly minimizes VRAM demands. To assess the impact, we performed experiments using the A100-80GB and obtained the following results. The quantization is implemeted by bitsandbytes. Check out the paper to learn more.

BatchSize=1 for fair comparison

Model	Max Length	Task/Dataset	Precision	Batch Size	Inference	Training	Single GPU
LLaMA2-70B	512	Language-only/Alpaca	BF16	1	145 GB	165 GB (NormBias)	❌
LLaMA2-70B	512	Language-only/Alpaca	NF4	1	36 GB	46 GB (NormBias)	✔
LLaMA2-13B Q-Fomer	512	Multi-modal/LLaVA-Instruct-150K	BF16	1	31 GB	38 GB (NormBiasLoRA)	✔
LLaMA2-13B Q-Fomer	512	Multi-modal LLaVA-Instruct-150K	NF4	1	13 GB	15 GB (NormBiasLoRA)	✔

GPU hours of finetuning

Note that we use 8x A100-80GB GPU cards for finetuning. The GPU hour refers to number_of_cards * total_training_time.

Model	Task / Dataset	Samples	Epoch	Precision	GPU Hours	8x A100 Training Time
LLaMA2-70B	Language-only/Alpaca	52K	4	BF16	100h	12.5h
LLaMA2-70B	Language-only/Alpaca	52K	4	NF4	80h	10h
LLaMA2-13B Q-Former	Multi-modal/LLaVA-Instruct-150K	150K	3	BF16	170h	20h
LLaMA2-13B Q-Former	Multi-modal/LLaVA-Instruct-150K	150K	3	NF4	88h	11h

Inference#

The trainable weights are saved in outdir when QPEFT is done. Run with following scripts:

Language-only LLaMA2

# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"

torchrun --nproc-per-node=1  demos/single_turn.py \
--llama_type "llama_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/llama>  </path/to/trainable/params> \
--quant

Multi-modal LLaMA2

# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"

torchrun --nproc-per-node=1  demos/single_turn_mm.py \
--llama_type "llama_qformerv2_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/multi/modal/llama>  </path/to/trainable/params> \
--quant

Check inference.md for more details.