Prerequisites

Prerequisites#

To run our provided experiment scripts on you own machine, please first adjust the following configurations:

  • Modify the value of the pretrained_path variable in the .sh file. This variable should point to the directory containing checkpoints to finetune from.

  • If you finetune from the officianl LLaMA / LLaMA2 checkpoints released by META, the directory should be like:

    pretrained_path
    ├── consolidated.00.pth
    ├── consolidated.01.pth
    └── ...
    

    and your should set pretrained_type=meta_ori in the .sh file.

    Alternatively, you may also finetune from checkpoints saved by LLaMA2-Accessory. In such cases, the directory should be like:

    pretrained_path
    ├── consolidated.00-of-**.model.pth
    ├── consolidated.01-of-**.model.pth
    └── ...
    

    and your should set pretrained_type=consolidated in the .sh file

  • Point llama_config in .sh scripts to the model configuration files (*.json) that specify model size (7B, 13B, …) and other settings (if any). See here to know more.

  • Point tokenizer_path in .sh to the tokenizer, See more here.

  • Point the data_config argument in .sh to a .yaml file defining the collection of finetuning datasets, each of which is identified by a .json meta file.

  • Modify model parallel size properly. ‘model parallel size’ specifies how the parameters of each complete model are split and distributed across multiple GPUs. The Meta official has provided a set of corresponding relationships, for example, 7B corresponds to a model parallel size of 1, 13B corresponds to 2, and 70B corresponds to 8. The effect of this is to keep the load on each GPU relatively constant as the total number of model parameters increases. Overall, following this guideline is generally a good choice in most situations; however, if you are very familiar with the subject, you can also try to break this binding.

Important

LLaMA2-Accessory itself supports model parallelism (which, within the current scope of LLaMA2-Accessory, is equivalent to tensor parallelism) and Fully Sharded Data Parallel (FSDP). Both of these involve the partitioning of the model, but it is important to note that they are very different and orthogonal (i.e., they can be used simultaneously) technologies. A brief understanding of these two technologies is very helpful for better utilizing LLaMA2-Accessory. This blog from Microsoft is an excellent learning resource.