Finetuning with Quantization

Finetuning with Quantization#

We support Quantized Parameter-Efficient Fine-Tuning (QPEFT) methods including QNormBias and QNormBiasLoRA, which significantly minimize the computing demands. In QPEFT, we quantize the base model while only retain carefully selected trainable parameters.

  • QNormBias. Only the bias term and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.

  • QNormBiasLoRA. The bias term, LoRA weights, and normalization weights are allowed for gradient updates. The pretrained LLaMA2 weights are quantized and frozen.

Best Practice#

# Enable quantization with flag "--quant" and "--only_save_trainable"
torchrun <--some_flags> main_finetune.py <--some_flags> \
--quant --only_save_trainable

For more details, please check the following scripts:

Method

Finetune Language-only LLaMA 2

Finetune Multi-Modal LLaMA 2

QNormBias

alpaca_llamaPeft_normBias_QF.sh

-

QNormBiasLoRA

alpaca_llamaPeft_normBiasLora_QF.sh

alpacaLlava_llamaQformerv2Peft_QF_13B.sh

Comparison#

Models can be loaded in 4-bit NormalFloat (NF4) data format which optimizes both inference and training processes and significantly minimizes VRAM demands. To assess the impact, we performed experiments using the A100-80GB and obtained the following results. The quantization is implemeted by bitsandbytes. Check out the paper to learn more.

  • BatchSize=1 for fair comparison

Model

Max Length

Task/Dataset

Precision

Batch Size

Inference

Training

Single GPU

LLaMA2-70B

512

Language-only/Alpaca

BF16

1

145 GB

165 GB (NormBias)

LLaMA2-70B

512

Language-only/Alpaca

NF4

1

36 GB

46 GB (NormBias)

LLaMA2-13B Q-Fomer

512

Multi-modal/LLaVA-Instruct-150K

BF16

1

31 GB

38 GB (NormBiasLoRA)

LLaMA2-13B Q-Fomer

512

Multi-modal LLaVA-Instruct-150K

NF4

1

13 GB

15 GB (NormBiasLoRA)

  • GPU hours of finetuning

Note that we use 8x A100-80GB GPU cards for finetuning. The GPU hour refers to number_of_cards * total_training_time.

Model

Task / Dataset

Samples

Epoch

Precision

GPU Hours

8x A100 Training Time

LLaMA2-70B

Language-only/Alpaca

52K

4

BF16

100h

12.5h

LLaMA2-70B

Language-only/Alpaca

52K

4

NF4

80h

10h

LLaMA2-13B Q-Former

Multi-modal/LLaVA-Instruct-150K

150K

3

BF16

170h

20h

LLaMA2-13B Q-Former

Multi-modal/LLaVA-Instruct-150K

150K

3

NF4

88h

11h

Inference#

The trainable weights are saved in outdir when QPEFT is done. Run with following scripts:

  • Language-only LLaMA2

# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"

torchrun --nproc-per-node=1  demos/single_turn.py \
--llama_type "llama_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/llama>  </path/to/trainable/params> \
--quant
  • Multi-modal LLaMA2

# if NormBias
peft_config=""
# elif NormBiasLora
peft_config="configs/model/finetune/sg/llamaPeft_normBiasLora.json"

torchrun --nproc-per-node=1  demos/single_turn_mm.py \
--llama_type "llama_qformerv2_peft"
--llama_config </path/to/params.json> $peft_config \
--tokenizer_path </path/to/tokenizer.model> \
--pretrained_path </path/to/multi/modal/llama>  </path/to/trainable/params> \
--quant

Check inference.md for more details.