1 year ago · 43e4c11d9d
--- a/eval/README.md
+++ b/eval/README.md
@@ -94,4 +94,53 @@ For all these evaluations, a higher score is a better score. We chose these benc
 
																 In case you have customized the Llama model, for example a quantized version of model where it has different model loading from normal HF model, you can follow [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage) to add your model to the `eval.py` and run the eval benchmarks.
															
 
																-You can also find full task list [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
															
 
																+You can also find full task list [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
															
 
																+
															
 
																+
															
 
																+
															
 
																+#### Multi-GPU Evaluation with Hugging Face `accelerate`
															
 
																+
															
 
																+Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library can be used for multi-GPU evaluation as it is supported by `lm-evaluation-harness`.
															
 
																+
															
 
																+To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), leverage the `accelerate` launcher as follows:
															
 
																+
															
 
																+
															
 
																+```bash
															
 
																+accelerate config
															
 
																+
															
 
																+accelerate launch eval.py --model hf --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf" --limit 100 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples 
															
 
																+```
															
 
																+
															
 
																+In case your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
															
 
																+
															
 
																+**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
															
 
																+
															
 
																+In case your model is *too large to fit on a single GPU.*
															
 
																+
															
 
																+In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
															
 
																+
															
 
																+```bash
															
 
																+python eval.py --model hf --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf,parallelize=True" --limit 100 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples 
															
 
																+```
															
 
																+
															
 
																+
															
 
																+This means that your model's weights will be split across all available GPUs.
															
 
																+
															
 
																+For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
															
 
																+- `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
															
 
																+- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
															
 
																+- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
															
 
																+- `offload_folder`: a folder where model weights will be offloaded to disk if needed.
															
 
																+
															
 
																+These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
															
 
																+
															
 
																+### Tensor + Data Parallel and Optimized Inference with `vLLM`
															
 
																+
															
 
																+Also `lm-evaluation-harness` supports vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
															
 
																+
															
 
																+```bash
															
 
																+python eval.py --model vllm --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=2" --limit 10 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples --batch_size auto
															
 
																+```
															
 
																+For a full list of supported vLLM configurations, please to [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/076372ee9ee81e25c4e2061256400570354a8d1a/lm_eval/models/vllm_causallms.py#L44-L62).
															
 
																+
															
 
																+**Note from `lm-evaluation-harness`** vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.