1 year ago · 3436e7cefb
--- a/benchmarks/inference_throughput/README.md
+++ b/benchmarks/inference_throughput/README.md
@@ -9,7 +9,7 @@ In this folder we provide a series of benchmark scripts that apply a throughput
 
																 There are three major reasons we want to run these benchmarks and share them with our Llama community:
															
 
																 * Provide inference throughput analysis based on real world situation to help you better select which service or deployment works the best for your scenario
															
 
																 * Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions works the best for your scenario
															
 
																-* Encourge the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
															
 
																+* Encourage the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
															
 
																 # Parameters
															
 
																 Here are the parameters (if applicable) that you can configure for running the benchmark:
															
@@ -22,7 +22,7 @@ Here are the parameters (if applicable) that you can configure for running the b
 
																 * **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)
															
 
																 * **TOKENIZER_PATH** - Tokenizer source
															
 
																 * **RANDOM_PROMPT_LENGTH** - Random prompt length (for pretrained models)
															
 
																-* **NUM_GPU** - Number of GPUs for request dispatch among muiltiple containers
															
 
																+* **NUM_GPU** - Number of GPUs for request dispatch among multiple containers
															
 
																 * **TEMPERATURE** - Temperature for inference
															
 
																 * **TOP_P** - Top_p for inference
															
 
																 * **MODEL_ENDPOINTS** - Container endpoints
															
--- a/benchmarks/inference_throughput/on-perm/README.md
+++ b/benchmarks/inference_throughput/on-perm/README.md
@@ -7,11 +7,11 @@ We support benchmark on these serving framework:
 
																 # Getting Started
															
 
																 To get started, we first need to deploy containers on-perm as a API host. Follow the guidance [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-perm.
															
 
																-Note that depends on the number of GPUs and size of their VRAM you have on the instance or local machine. We suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among muiltiple GPUs for model parallelism.  
															
 
																+Note that depends on the number of GPUs and size of their VRAM you have on the instance or local machine. We suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism.  
															
 
																 For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy Llama 2 70B chat model. 70B chat model is around 130GB with FP16. So for deployment we can do:
															
 
																 * 1x70B model parallel on 8 GPUs.
															
 
																 * 2x70B models each use 4 GPUs.
															
 
																-* 4x70B models each use 2 GPUs. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on differet ports and the benchmark script will route requests into each model equally)
															
 
																+* 4x70B models each use 2 GPUs. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
															
 
																 Here are examples for deploying 2x70B chat models over 8 GPUs with vLLM.
															
 
																 ```