1 年之前 · e80c2588a6
--- a/benchmarks/inference_throughput/on-perm/README.md
+++ b/benchmarks/inference_throughput/on-perm/README.md
@@ -7,11 +7,11 @@ We support benchmark on these serving framework:
 
				 
			
 
				 # vLLM - Getting Started
			
 
				 To get started, we first need to deploy containers on-perm as a API host. Follow the guidance [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-perm.
			
 
				-Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism.  
			
 
				-For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Llama 2 70B chat model, which is around 130GB with FP16. So for deployment we can do:
			
 
				-* 1x70B model parallel on 8 GPUs, each GPU RAM takes around 16.25GB for loading model weights.
			
 
				-* 2x70B models each use 4 GPUs, each GPU RAM takes around 32.5GB for loading model weights.
			
 
				-* 4x70B models each use 2 GPUs, each GPU RAM takes around 65GB for loading model weights. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
			
 
				+Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism. Additionally, as deploying multiple model replicas, there is a need for a higher level wrapper to handle the load balancing which here has been simulated in the benchmark scripts.  
			
 
				+For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Llama 2 70B chat model, which is around 140GB with FP16. So for deployment we can do:
			
 
				+* 1x70B model parallel on 8 GPUs, each GPU RAM takes around 17.5GB for loading model weights.
			
 
				+* 2x70B models each use 4 GPUs, each GPU RAM takes around 35GB for loading model weights.
			
 
				+* 4x70B models each use 2 GPUs, each GPU RAM takes around 70GB for loading model weights. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
			
 
				 
			
 
				 Here are examples for deploying 2x70B chat models over 8 GPUs with vLLM.
			
 
				 ```
			
@@ -29,7 +29,7 @@ If you are going to use [Azure AI content check](https://azure.microsoft.com/en-
 
				 ```
			
 
				 pip install azure-ai-contentsafety azure-core
			
 
				 ```
			
 
				-Besides chat models, we also provide benchmark scripts for running pretrained models for text generation tasks. To better simulate the real traffic, we generate configurable random token prompt as input. In this process, we select vocabulary that is longer than 2 tokens so the generated words are closer to the English, rather than symbols.
			
 
				+Besides chat models, we also provide benchmark scripts for running pretrained models for text completion tasks. To better simulate the real traffic, we generate configurable random token prompt as input. In this process, we select vocabulary that is longer than 2 tokens so the generated words are closer to the English, rather than symbols.
			
 
				 However, random token prompts can't be applied for chat model benchmarks, since the chat model expects a valid question. By feeding random prompts, chat models rarely provide answers that is meeting our ```MAX_NEW_TOKEN``` requirement, defeating the purpose of running throughput benchmarks. Hence for chat models, the questions are copied over to form long inputs such as for 2k and 4k inputs.   
			
 
				 To run pretrained model benchmark, follow the command below.
			
 
				 ```
			
--- a/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/chat_vllm_benchmark.py
@@ -26,6 +26,7 @@ with open('input.jsonl') as input:
 
				     prompt_data = json.load(input)
			
 
				 
			
 
				 # Prompt data stored in json file. Choose from number of tokens - 5, 25, 50, 100, 500, 1k, 2k.
			
 
				+# You can also configure and add your own prompt in input.jsonl
			
 
				 PROMPT = prompt_data["1k"] 
			
 
				 
			
 
				 with open('parameters.json') as parameters:
			
@@ -43,7 +44,7 @@ THRESHOLD_TPS = params["THRESHOLD_TPS"]
 
				 TOKENIZER_PATH = params["TOKENIZER_PATH"] 
			
 
				 TEMPERATURE = params["TEMPERATURE"]
			
 
				 TOP_P = params["TOP_P"]
			
 
				-# Add your model endpoints here, specify the port number. 
			
 
				+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-perm server like vLLM.
			
 
				 # Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
			
 
				 MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
			
 
				 
			
@@ -114,6 +115,8 @@ def generate_text() -> Tuple[int, int]:
 
				     start_time = time.time()
			
 
				 
			
 
				     if(SAFE_CHECK):
			
 
				+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
			
 
				+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
			
 
				         analyze_prompt(PROMPT)
			
 
				         # Or add delay simulation as below for real world situation
			
 
				         # time.sleep(random.uniform(0.3, 0.4))
			
@@ -133,6 +136,8 @@ def generate_text() -> Tuple[int, int]:
 
				     response = requests.post(MODEL_ENDPOINTS[endpoint_id], headers=headers, json=payload)
			
 
				 
			
 
				     if(SAFE_CHECK):
			
 
				+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
			
 
				+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
			
 
				         analyze_prompt(PROMPT)
			
 
				         # Or add delay simulation as below for real world situation
			
 
				         # time.sleep(random.uniform(0.3, 0.4))
			
--- a/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
+++ b/benchmarks/inference_throughput/on-perm/vllm/pretrained_vllm_benchmark.py
@@ -41,7 +41,7 @@ TOKENIZER_PATH = params["TOKENIZER_PATH"]
 
				 RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
			
 
				 TEMPERATURE = params["TEMPERATURE"]
			
 
				 TOP_P = params["TOP_P"]
			
 
				-# Add your model endpoints here, specify the port number. 
			
 
				+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-perm server like vLLM.
			
 
				 # Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
			
 
				 MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
			
 
				 
			
@@ -127,6 +127,8 @@ def generate_text() -> Tuple[int, int]:
 
				     start_time = time.time()
			
 
				 
			
 
				     if(SAFE_CHECK):
			
 
				+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
			
 
				+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
			
 
				         analyze_prompt(PROMPT)
			
 
				         # Or add delay simulation as below for real world situation
			
 
				         # time.sleep(random.uniform(0.3, 0.4))
			
@@ -144,6 +146,8 @@ def generate_text() -> Tuple[int, int]:
 
				     response = requests.post(MODEL_ENDPOINTS[endpoint_id], headers=headers, json=payload)
			
 
				 
			
 
				     if(SAFE_CHECK):
			
 
				+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
			
 
				+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
			
 
				         analyze_prompt(PROMPT)
			
 
				         # Or add delay simulation as below for real world situation
			
 
				         # time.sleep(random.uniform(0.3, 0.4))