Prerequisites#
To run our provided experiment scripts on you own machine, please first adjust the following configurations:
Modify the value of the
pretrained_pathvariable in the.shfile. This variable should point to the directory containing checkpoints to finetune from.If you finetune from the officianl LLaMA / LLaMA2 checkpoints released by META, the directory should be like:
pretrained_path ├── consolidated.00.pth ├── consolidated.01.pth └── ...
and your should set
pretrained_type=meta_oriin the.shfile.Alternatively, you may also finetune from checkpoints saved by LLaMA2-Accessory. In such cases, the directory should be like:
pretrained_path ├── consolidated.00-of-**.model.pth ├── consolidated.01-of-**.model.pth └── ...
and your should set
pretrained_type=consolidatedin the.shfilePoint
llama_configin.shscripts to the model configuration files (*.json) that specify model size (7B, 13B, …) and other settings (if any). See here to know more.Point
tokenizer_pathin.shto the tokenizer, See more here.Point the
data_configargument in.shto a.yamlfile defining the collection of finetuning datasets, each of which is identified by a.jsonmeta file.Modify model parallel size properly. ‘model parallel size’ specifies how the parameters of each complete model are split and distributed across multiple GPUs. The Meta official has provided a set of corresponding relationships, for example, 7B corresponds to a model parallel size of 1, 13B corresponds to 2, and 70B corresponds to 8. The effect of this is to keep the load on each GPU relatively constant as the total number of model parameters increases. Overall, following this guideline is generally a good choice in most situations; however, if you are very familiar with the subject, you can also try to break this binding.
Important
LLaMA2-Accessory itself supports model parallelism (which, within the current scope of LLaMA2-Accessory, is equivalent to tensor parallelism) and Fully Sharded Data Parallel (FSDP). Both of these involve the partitioning of the model, but it is important to note that they are very different and orthogonal (i.e., they can be used simultaneously) technologies. A brief understanding of these two technologies is very helpful for better utilizing LLaMA2-Accessory. This blog from Microsoft is an excellent learning resource.