1 year ago · fce0485634
--- a/README.md
+++ b/README.md
@@ -197,7 +197,7 @@ This folder contains a series of Llama2-powered apps:
 
				 
			
 
				 # Benchmarks
			
 
				 This folder contains a series of benchmark scripts for Llama 2 models inference on various backends:
			
 
				-1. On-perm - Popular serving frameworks and containers (i.e. vLLM)
			
 
				+1. On-prem - Popular serving frameworks and containers (i.e. vLLM)
			
 
				 2. (WIP) Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
			
 
				 3. (WIP) On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
			
 
				 4. (WIP) Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
			
--- a/benchmarks/inference_throughput/README.md
+++ b/benchmarks/inference_throughput/README.md
@@ -1,6 +1,6 @@
 
				 # Inference Throughput Benchmarks
			
 
				 In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
			
 
				-* On-perm - Popular serving frameworks and containers (i.e. vLLM)
			
 
				+* On-prem - Popular serving frameworks and containers (i.e. vLLM)
			
 
				 * [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
			
 
				 * [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
			
 
				 * [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
			
--- a/benchmarks/inference_throughput/on-perm/README.md
+++ b/benchmarks/inference_throughput/on-perm/README.md
@@ -1,12 +1,12 @@
 
				-# Llama-On-Perm-Benchmark
			
 
				-This folder contains code to run inference benchmark for Llama 2 models on-perm with popular serving frameworks.
			
 
				+# Llama-On-Prem-Benchmark
			
 
				+This folder contains code to run inference benchmark for Llama 2 models on-prem with popular serving frameworks.
			
 
				 The benchmark will focus on overall inference **throughput** for running containers on one instance (single or multiple GPUs) that you can acquire from cloud service providers such as Azure and AWS. You can also run this benchmark on local laptop or desktop.  
			
 
				 We support benchmark on these serving framework:
			
 
				 * [vLLM](https://github.com/vllm-project/vllm)
			
 
				 
			
 
				 
			
 
				 # vLLM - Getting Started
			
 
				-To get started, we first need to deploy containers on-perm as a API host. Follow the guidance [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-perm.
			
 
				+To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-prem.
			
 
				 Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism. Additionally, as deploying multiple model replicas, there is a need for a higher level wrapper to handle the load balancing which here has been simulated in the benchmark scripts.  
			
 
				 For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Llama 2 70B chat model, which is around 140GB with FP16. So for deployment we can do:
			
 
				 * 1x70B model parallel on 8 GPUs, each GPU RAM takes around 17.5GB for loading model weights.
			
--- a/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
@@ -44,7 +44,7 @@ THRESHOLD_TPS = params["THRESHOLD_TPS"]
 
				 TOKENIZER_PATH = params["TOKENIZER_PATH"] 
			
 
				 TEMPERATURE = params["TEMPERATURE"]
			
 
				 TOP_P = params["TOP_P"]
			
 
				-# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-perm server like vLLM.
			
 
				+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-prem server like vLLM.
			
 
				 # Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
			
 
				 MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
			
 
				 
			
--- a/benchmarks/inference_throughput/on-perm/vllm/input.jsonl
+++ b/benchmarks/inference_throughput/on-perm/vllm/input.jsonl
--- a/benchmarks/inference_throughput/on-perm/vllm/parameters.json
+++ b/benchmarks/inference_throughput/on-perm/vllm/parameters.json
--- a/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
@@ -41,7 +41,7 @@ TOKENIZER_PATH = params["TOKENIZER_PATH"]
 
				 RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
			
 
				 TEMPERATURE = params["TEMPERATURE"]
			
 
				 TOP_P = params["TOP_P"]
			
 
				-# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-perm server like vLLM.
			
 
				+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-prem server like vLLM.
			
 
				 # Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
			
 
				 MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
			
 
				 
			
--- a/benchmarks/inference_throughput/tokenizer/special_tokens_map.json
+++ b/benchmarks/inference_throughput/tokenizer/special_tokens_map.json
--- a/benchmarks/inference_throughput/tokenizer/tokenizer.json
+++ b/benchmarks/inference_throughput/tokenizer/tokenizer.json
--- a/benchmarks/inference_throughput/tokenizer/tokenizer.model
+++ b/benchmarks/inference_throughput/tokenizer/tokenizer.model
--- a/benchmarks/inference_throughput/tokenizer/tokenizer_config.json
+++ b/benchmarks/inference_throughput/tokenizer/tokenizer_config.json