1 년 전 · 9f84f73420
--- a/README.md
+++ b/README.md
@@ -195,7 +195,7 @@ This folder contains a series of Llama2-powered apps:
 
				 
			
 
				 # Benchmarks
			
 
				 This folder contains a series of benchmark scripts for Llama 2 models inference on various backends:
			
 
				-1. On-perm - Popular serving framework and containers (i.e. vLLM)
			
 
				+1. On-perm - Popular serving frameworks and containers (i.e. vLLM)
			
 
				 2. (WIP) Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
			
 
				 3. (WIP) On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
			
 
				 4. (WIP) Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
			
--- a/benchmarks/inference_throughput/README.md
+++ b/benchmarks/inference_throughput/README.md
@@ -1,20 +1,20 @@
 
				 # Inference Throughput Benchmarks
			
 
				 In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
			
 
				-* On-perm - Popular serving framework and containers (i.e. vLLM)
			
 
				+* On-perm - Popular serving frameworks and containers (i.e. vLLM)
			
 
				 * [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
			
 
				 * [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
			
 
				 * [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
			
 
				 
			
 
				 # Why
			
 
				 There are three major reasons we want to run these benchmarks and share them with our Llama community:
			
 
				-* Provide inference throughput analysis based on real world situation to help you better select which service or deployment works the best for your scenario
			
 
				-* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions works the best for your scenario
			
 
				+* Provide inference throughput analysis based on real world situation to help you select the best service or deployment for your scenario
			
 
				+* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions work best for your scenario
			
 
				 * Encourage the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
			
 
				 
			
 
				 # Parameters
			
 
				 Here are the parameters (if applicable) that you can configure for running the benchmark:
			
 
				 * **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
			
 
				-* **MAX_NEW_TOKEN** - Max token generated
			
 
				+* **MAX_NEW_TOKENS** - Max number of tokens generated
			
 
				 * **CONCURRENT_LEVELS** - Max number of concurrent requests
			
 
				 * **MODEL_PATH** - Model source
			
 
				 * **MODEL_HEADERS** - Request headers
			
@@ -26,7 +26,7 @@ Here are the parameters (if applicable) that you can configure for running the b
 
				 * **TEMPERATURE** - Temperature for inference
			
 
				 * **TOP_P** - Top_p for inference
			
 
				 * **MODEL_ENDPOINTS** - Container endpoints
			
 
				-* Model parallelism or model replicas
			
 
				+* Model parallelism or model replicas - Load one model into multiple GPUs or multiple model replicas on one instance. More detail in the README files for specific containers.
			
 
				 
			
 
				 You can also configure other model hyperparameters as part of the request payload.  
			
 
				 All these parameters are stored in ```parameter.json``` and real prompts are stored in ```input.jsonl```. Running the script will load these configurations.
			
@@ -48,7 +48,7 @@ The benchmark will report these metrics per instance:
 
				 We intend to add these metrics in the future:
			
 
				 * Time to first token (TTFT)
			
 
				   
			
 
				-The benchmark result will be displayed in terminal output and saved as a CSV file (```performance_metrics.csv```) that you can export to spreadsheets.
			
 
				+The benchmark result will be displayed in the terminal output and saved as a CSV file (```performance_metrics.csv```) which you can export to spreadsheets.
			
 
				 
			
 
				 # Getting Started
			
 
				 Please follow the ```README.md``` in each subfolder for instructions on how to setup and run these benchmarks. 
			
--- a/benchmarks/inference_throughput/on-perm/README.md
+++ b/benchmarks/inference_throughput/on-perm/README.md
@@ -7,28 +7,28 @@ We support benchmark on these serving framework:
 
				 
			
 
				 # vLLM - Getting Started
			
 
				 To get started, we first need to deploy containers on-perm as a API host. Follow the guidance [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-perm.
			
 
				-Note that depends on the number of GPUs and size of their VRAM you have on the instance or local machine. We suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism.  
			
 
				-For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy Llama 2 70B chat model. 70B chat model is around 130GB with FP16. So for deployment we can do:
			
 
				-* 1x70B model parallel on 8 GPUs.
			
 
				-* 2x70B models each use 4 GPUs.
			
 
				-* 4x70B models each use 2 GPUs. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
			
 
				+Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism.  
			
 
				+For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Llama 2 70B chat model, which is around 130GB with FP16. So for deployment we can do:
			
 
				+* 1x70B model parallel on 8 GPUs, each GPU RAM takes around 16.25GB for loading model weights.
			
 
				+* 2x70B models each use 4 GPUs, each GPU RAM takes around 32.5GB for loading model weights.
			
 
				+* 4x70B models each use 2 GPUs, each GPU RAM takes around 65GB for loading model weights. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
			
 
				 
			
 
				 Here are examples for deploying 2x70B chat models over 8 GPUs with vLLM.
			
 
				 ```
			
 
				 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Llama-2-70b-chat-hf --tensor-parallel-size 4 --disable-log-requests --port 8000 
			
 
				 CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Llama-2-70b-chat-hf --tensor-parallel-size 4 --disable-log-requests --port 8001 
			
 
				 ```
			
 
				-Once you finished deployment, you can use the command below to run benchmark scripts in a separate terminal. 
			
 
				+Once you have finished deployment, you can use the command below to run benchmark scripts in a separate terminal. 
			
 
				 
			
 
				 ```
			
 
				 python chat_vllm_benchmark.py
			
 
				 ```
			
 
				-If you are going to use [Azure AI content check](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety), then you should install dependencies as below in your terminal:
			
 
				+If you are going to use [Azure AI content check](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety), then you should install dependencies as shown below in your terminal:
			
 
				 ```
			
 
				 pip install azure-ai-contentsafety azure-core
			
 
				 ```
			
 
				 Besides chat models, we also provide benchmark scripts for running pretrained models for text generation tasks. To better simulate the real traffic, we generate configurable random token prompt as input. In this process, we select vocabulary that is longer than 2 tokens so the generated words are closer to the English, rather than symbols.
			
 
				-However, random token prompts can't be applied for chat model benchmarks, since the chat model was expecting a valid question. By feeding random prompts, chat models rarely provide answers that is meeting our ```MAX_NEW_TOKEN``` requirement. Defeating the purpose of running throughput benchmarks. Hence for chat models, the questions are copied over to form long inputs such as for 2k and 4k inputs.   
			
 
				+However, random token prompts can't be applied for chat model benchmarks, since the chat model expects a valid question. By feeding random prompts, chat models rarely provide answers that is meeting our ```MAX_NEW_TOKEN``` requirement, defeating the purpose of running throughput benchmarks. Hence for chat models, the questions are copied over to form long inputs such as for 2k and 4k inputs.   
			
 
				 To run pretrained model benchmark, follow the command below.
			
 
				 ```
			
 
				 python pretrained_vllm_benchmark.py
			
--- a/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
@@ -31,7 +31,7 @@ PROMPT = prompt_data["1k"]
 
				 with open('parameters.json') as parameters:
			
 
				     params = json.load(parameters)
			
 
				 
			
 
				-MAX_NEW_TOKEN = params["MAX_NEW_TOKEN"]
			
 
				+MAX_NEW_TOKENS = params["MAX_NEW_TOKENS"]
			
 
				 CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
			
 
				 # Replace with your own deployment
			
 
				 MODEL_PATH = params["MODEL_PATH"]
			
@@ -108,7 +108,7 @@ def generate_text() -> Tuple[int, int]:
 
				         "stream" : False,
			
 
				         "temperature" : TEMPERATURE,
			
 
				         "top_p" : TOP_P,
			
 
				-        "max_tokens" : MAX_NEW_TOKEN
			
 
				+        "max_tokens" : MAX_NEW_TOKENS
			
 
				     }
			
 
				 
			
 
				     start_time = time.time()
			
--- a/benchmarks/inference_throughput/on-perm/vllm/parameters.json
+++ b/benchmarks/inference_throughput/on-perm/vllm/parameters.json
@@ -1,5 +1,5 @@
 
				 {
			
 
				-    "MAX_NEW_TOKEN" : 256,
			
 
				+    "MAX_NEW_TOKENS" : 256,
			
 
				     "CONCURRENT_LEVELS" : [1, 2, 4, 8, 16, 32, 64, 128, 256],
			
 
				     "MODEL_PATH" : "meta-llama/Llama-2-7b-chat-hf",
			
 
				     "MODEL_HEADERS" : {"Content-Type": "application/json"},
			
--- a/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
@@ -28,7 +28,7 @@ with open('input.jsonl') as input:
 
				 with open('parameters.json') as parameters:
			
 
				     params = json.load(parameters)
			
 
				 
			
 
				-MAX_NEW_TOKEN = params["MAX_NEW_TOKEN"]
			
 
				+MAX_NEW_TOKENS = params["MAX_NEW_TOKENS"]
			
 
				 CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
			
 
				 # Replace with your own deployment
			
 
				 MODEL_PATH = params["MODEL_PATH"]
			
@@ -121,7 +121,7 @@ def generate_text() -> Tuple[int, int]:
 
				         "stream" : False,
			
 
				         "temperature" : TEMPERATURE,
			
 
				         "top_p" : TOP_P,
			
 
				-        "max_tokens" : MAX_NEW_TOKEN
			
 
				+        "max_tokens" : MAX_NEW_TOKENS
			
 
				     }
			
 
				 
			
 
				     start_time = time.time()