Multimodel evaluation#

environment settings#

Before running the Light-eval, users must ensure that they have correctly installed and configured all necessary environments according to the instructions in the Installation Document.

LLaVA-benchmark#

Prerequisites#

dataset

├── data
│   └── LLaVA-benchmark
│       ├── images
│       │   ├── 001.jpg
│       │   ├── 002.jpg
│       │   └── ...
│       ├── answers_gpt4.jsonl
│       ├── context.jsonl
│       └── ...
└── ...

Store the images folder according to the file structure given above in data.

The dataset is availabel at 🤗Hugging Face/liuhaotian

evaluating#

Please make sure the dataset is stored according to the storage structure described above.

Change the parameters in scripts/run_llavabenchmark.sh: model_name, pretrained_path, llama_config, tokenizer_path, openai_key and mode .

mode settings:

  • inference: Get model answers.

  • eval: Use GPT4 to score the modle’s answers against the GPT4 answers.

  • show: Output of the scored results

  • all: Inferring, scoring, and outputting results for models.

After changing parameters, you can use following script to run LLaVA-benchmark evaluation code for your model.

script

sh scripts/run_llavabenchmark.sh

MM-Vet benchmark#

Prerequisites#

dataset

├── data
│   └── MM-Vet
│       ├── images
│       │   ├── v1_0.png
│       │   ├── v1_2.png
│       │   └── ...
│       ├── mm-vet.json
│       └── bard_set.json
└── ...

Store the images folder according to the file structure given above in data

Download MM-Vet data yuweihao/mm-vet.zip and unzip the dataset file according to the format described above.

evaluating#

Please make sure the dataset is stored according to the storage structure described above.

Change the parameters in scripts/run_mmvet.sh: model_name, pretrained_path, llama_config, tokenizer_path, openai_key, use_sub_set and mode .

mode settings:

  • inference: Get model answers.

  • eval: Use GPT4 to score the modle’s answers against the GPT4 answers.

  • all: Inferring, outputting results for models.

use_sub_set:

  • True: use subset for evaluation.

  • False: use the full dataset for evaluation.

After changing parameters, you can use following script to run MM_Vet benchmark evaluation code for your model.

script

sh scripts/run_mmvet.sh