|
@@ -1,20 +1,20 @@
|
|
# Inference Throughput Benchmarks
|
|
# Inference Throughput Benchmarks
|
|
In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
|
|
In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
|
|
-* On-perm - Popular serving framework and containers (i.e. vLLM)
|
|
|
|
|
|
+* On-perm - Popular serving frameworks and containers (i.e. vLLM)
|
|
* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
|
|
* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
|
|
* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
|
|
* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
|
|
* [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
|
|
* [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
|
|
|
|
|
|
# Why
|
|
# Why
|
|
There are three major reasons we want to run these benchmarks and share them with our Llama community:
|
|
There are three major reasons we want to run these benchmarks and share them with our Llama community:
|
|
-* Provide inference throughput analysis based on real world situation to help you better select which service or deployment works the best for your scenario
|
|
|
|
-* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions works the best for your scenario
|
|
|
|
|
|
+* Provide inference throughput analysis based on real world situation to help you select the best service or deployment for your scenario
|
|
|
|
+* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions work best for your scenario
|
|
* Encourage the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
|
|
* Encourage the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
|
|
|
|
|
|
# Parameters
|
|
# Parameters
|
|
Here are the parameters (if applicable) that you can configure for running the benchmark:
|
|
Here are the parameters (if applicable) that you can configure for running the benchmark:
|
|
* **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
|
|
* **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
|
|
-* **MAX_NEW_TOKEN** - Max token generated
|
|
|
|
|
|
+* **MAX_NEW_TOKENS** - Max number of tokens generated
|
|
* **CONCURRENT_LEVELS** - Max number of concurrent requests
|
|
* **CONCURRENT_LEVELS** - Max number of concurrent requests
|
|
* **MODEL_PATH** - Model source
|
|
* **MODEL_PATH** - Model source
|
|
* **MODEL_HEADERS** - Request headers
|
|
* **MODEL_HEADERS** - Request headers
|
|
@@ -26,7 +26,7 @@ Here are the parameters (if applicable) that you can configure for running the b
|
|
* **TEMPERATURE** - Temperature for inference
|
|
* **TEMPERATURE** - Temperature for inference
|
|
* **TOP_P** - Top_p for inference
|
|
* **TOP_P** - Top_p for inference
|
|
* **MODEL_ENDPOINTS** - Container endpoints
|
|
* **MODEL_ENDPOINTS** - Container endpoints
|
|
-* Model parallelism or model replicas
|
|
|
|
|
|
+* Model parallelism or model replicas - Load one model into multiple GPUs or multiple model replicas on one instance. More detail in the README files for specific containers.
|
|
|
|
|
|
You can also configure other model hyperparameters as part of the request payload.
|
|
You can also configure other model hyperparameters as part of the request payload.
|
|
All these parameters are stored in ```parameter.json``` and real prompts are stored in ```input.jsonl```. Running the script will load these configurations.
|
|
All these parameters are stored in ```parameter.json``` and real prompts are stored in ```input.jsonl```. Running the script will load these configurations.
|
|
@@ -48,7 +48,7 @@ The benchmark will report these metrics per instance:
|
|
We intend to add these metrics in the future:
|
|
We intend to add these metrics in the future:
|
|
* Time to first token (TTFT)
|
|
* Time to first token (TTFT)
|
|
|
|
|
|
-The benchmark result will be displayed in terminal output and saved as a CSV file (```performance_metrics.csv```) that you can export to spreadsheets.
|
|
|
|
|
|
+The benchmark result will be displayed in the terminal output and saved as a CSV file (```performance_metrics.csv```) which you can export to spreadsheets.
|
|
|
|
|
|
# Getting Started
|
|
# Getting Started
|
|
Please follow the ```README.md``` in each subfolder for instructions on how to setup and run these benchmarks.
|
|
Please follow the ```README.md``` in each subfolder for instructions on how to setup and run these benchmarks.
|