# Inference Throughput Benchmarks
In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
* On-perm - Popular serving framework and containers (i.e. vLLM)
* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
* [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)

# Why
There are three major reasons we want to run these benchmarks and share them with our Llama community:
* Provide inference throughput analysis based on real world situation to help you better select which service or deployment works the best for your scenario
* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions works the best for your scenario
* Encourge the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area

# Parameters
Here are the parameters (if applicable) that you can configure for running the benchmark:
* **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
* **MAX_NEW_TOKEN** - Max token generated
* **CONCURRENT_LEVELS** - Max number of concurrent requests
* **MODEL_PATH** - Model source
* **MODEL_HEADERS** - Request headers
* **SAFE_CHECK** - Content safety check (either Azure service or simulated latency)
* **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)
* **TOKENIZER_PATH** - Tokenizer source
* **RANDOM_PROMPT_LENGTH** - Random prompt length (for pretrained models)
* **NUM_GPU** - Number of GPUs for request dispatch among muiltiple containers
* **TEMPERATURE** - Temperature for inference
* **TOP_P** - Top_p for inference
* **MODEL_ENDPOINTS** - Container endpoints
* Model parallelism or model replicas

You can also configure other model hyperparameters as part of the request payload.  
All these parameters are stored in ```parameter.json``` and real prompts are stored in ```input.jsonl```. Running the script will load these configurations.


# Metrics
The benchmark will report these metrics per instance:
* Number of concurrent requests
* P50 Latency(ms)
* P99 Latency(ms)
* Request per second (RPS)
* Output tokens per second
* Output tokens per second per GPU
* Input tokens per second
* Input tokens per second per GPU
* Average tokens per second per request

We intend to add these metrics in the future:
* Time to first token (TTFT)
  
The benchmark result will be displayed in terminal output and saved as a CSV file (```performance_metrics.csv```) that you can export to spreadsheets.

# Getting Started
Please follow the ```README.md``` in each subfolder for instructions on how to setup and run these benchmarks.