# Benchmark Llama models on AWS The [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) tool provides a quick and easy way to benchmark the Llama family of models for price and performance on any AWS service including [`Amazon SagMaker`](https://aws.amazon.com/solutions/guidance/generative-ai-deployments-using-amazon-sagemaker-jumpstart/), [`Amazon Bedrock`](https://aws.amazon.com/bedrock/) or `Amazon EKS` or `Amazon EC2` as `Bring your own endpoint`. ## The need for benchmarking Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source dataset ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench)), [`QMSum`](https://paperswithcode.com/dataset/qmsum). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves. ## [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main): an open-source Python package for FM benchmarking on AWS `FMBench` runs inference requests against endpoints that are either deployed through `FMBench` itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given Llama model for a given use-case. The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the `Llama2-13b` model on different instance types available on SageMaker using prompts for Q&A task created from the [`LongBench`](https://huggingface.co/datasets/THUDM/LongBench) dataset, these prompts are between 3000 to 3840 tokens in length. **_Note that the numbers are hidden in this figure but you would be able to see them when you run `FMBench` yourself_**. ![`Llama2-13b` on different instance types ](./img/instances.png) The following table (also included in the report) provides information about the best available instance type for that experiment1. |Information |Value | |--- |--- | |experiment_name |llama2-13b-inf2.24xlarge | |payload_file |payload_en_3000-3840.jsonl | |instance_type |ml.inf2.24xlarge | |concurrency |** | |error_rate |** | |prompt_token_count_mean |3394 | |prompt_token_throughput |2400 | |completion_token_count_mean |31 | |completion_token_throughput |15 | |latency_mean |** | |latency_p50 |** | |latency_p95 |** | |latency_p99 |** | |transactions_per_minute |** | |price_per_txn |** | 1 ** represent values hidden on purpose, these are available when you run the tool yourself. The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types). ![Effect of prompt size on inference latency for different concurrency levels](./img/latency_vs_tokens.png) ### How to get started with `FMBench` The following steps provide a Quick start guide for `FMBench`. For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details). 1. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run `FMBench` and a write S3 bucket is created which will hold the metrics and reports generated by `FMBench`. The CloudFormation stack takes about 5-minutes to create. |AWS Region | Link | |:------------------------:|:-----------:| |us-east-1 (N. Virginia) | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=fmbench&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-FMBT/template.yml) | |us-west-2 (Oregon) | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=fmbench&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-FMBT/template.yml) | 1. Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the `fmbench-notebook`. 1. On the `fmbench-notebook` open a Terminal and run the following commands. ```{.bash} conda create --name fmbench_python311 -y python=3.11 ipykernel source activate fmbench_python311; pip install -U fmbench ``` 1. Now you are ready to `fmbench` with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run. 1. We benchmark performance for the `Llama2-7b` model on a `ml.g5.xlarge` and a `ml.g5.2xlarge` instance type, using the `huggingface-pytorch-tgi-inference` inference container. This test would take about 30 minutes to complete and cost about $0.20. 1. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the `Llama2 tokenizer` (instructions are provided in the next section). ***It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer***. ```{.bash} account=`aws sts get-caller-identity | jq .Account | tr -d '"'` region=`aws configure get region` fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/config-llama2-7b-g5-quick.yml >> fmbench.log 2>&1 ``` 1. Open another terminal window and do a `tail -f` on the `fmbench.log` file to see all the traces being generated at runtime. ```{.bash} tail -f fmbench.log ``` 1. The generated reports and metrics are available in the `sagemaker-fmbench-write--` bucket. The metrics and report files are also downloaded locally and in the `results` directory (created by `FMBench`) and the benchmarking report is available as a markdown file called `report.md` in the `results` directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis. ## The `config.yml` file Each `FMBench` run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical `FMBench` workflow involves either directly using an already provided config file from the [`configs`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main/src/fmbench/configs) folder in the `FMBench` GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.). A simple config file with some key parameters annotated is presented below. The file below benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance. ```{markdown} general: name: "llama2-7b-v1" model_name: "Llama2-7b" # AWS and SageMaker settings aws: # AWS region, this parameter is templatized, no need to change region: {region} # SageMaker execution role used to run FMBench, this parameter is templatized, no need to change sagemaker_execution_role: {role_arn} # S3 bucket to which metrics, plots and reports would be written to bucket: {write_bucket} ## add the name of your desired bucket # directory paths in the write bucket, no need to change these dir_paths: data_prefix: data prompts_prefix: prompts all_prompts_file: all_prompts.csv metrics_dir: metrics models_dir: models metadata_dir: metadata # S3 information for reading datasets, scripts and tokenizer s3_read_data: # read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-{region}-{account_id} read_bucket: {read_bucket} # S3 prefix in the read bucket where deployment and inference scripts should be placed scripts_prefix: scripts # deployment and inference script files to be downloaded are placed in this list # only needed if you are creating a new deployment script or inference script # your HuggingFace token does need to be in this list and should be called "hf_token.txt" script_files: - hf_token.txt # configuration files (like this one) are placed in this prefix configs_prefix: configs # list of configuration files to download, for now only pricing.yml needs to be downloaded config_files: - pricing.yml # S3 prefix for the dataset files source_data_prefix: source_data # list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench source_data_files: - 2wikimqa_e.jsonl - 2wikimqa.jsonl - hotpotqa_e.jsonl - hotpotqa.jsonl - narrativeqa.jsonl - triviaqa_e.jsonl - triviaqa.jsonl # S3 prefix for the tokenizer to be used with the models # NOTE 1: the same tokenizer is used with all the models being tested through a config file # NOTE 2: place your model specific tokenizers in a prefix named as _tokenizer # so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in llama2_tokenizer tokenizer_prefix: tokenizer # S3 prefix for prompt templates prompt_template_dir: prompt_template # prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file # the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one prompt_template_file: prompt_template_llama2.txt # steps to run, usually all of these would be # set to yes so nothing needs to change here # you could, however, bypass some steps for example # set the 2_deploy_model.ipynb to no if you are re-running # the same config file and the model is already deployed run_steps: 0_setup.ipynb: yes 1_generate_data.ipynb: yes 2_deploy_model.ipynb: yes 3_run_inference.ipynb: yes 4_model_metric_analysis.ipynb: yes 5_cleanup.ipynb: yes # dataset related configuration datasets: # Refer to the 1_generate_data.ipynb notebook # the dataset you use is expected to have the # columns you put in prompt_template_keys list # and your prompt template also needs to have # the same placeholders (refer to the prompt template folder) prompt_template_keys: - input - context # if your dataset has multiple languages and it has a language # field then you could filter it for a language. Similarly, # you can filter your dataset to only keep prompts between # a certain token length limit (the token length is determined # using the tokenizer you provide in the tokenizer_prefix prefix in the # read S3 bucket). Each of the array entries below create a payload file # containing prompts matching the language and token length criteria. filters: - language: en min_length_in_tokens: 1 max_length_in_tokens: 500 payload_file: payload_en_1-500.jsonl - language: en min_length_in_tokens: 500 max_length_in_tokens: 1000 payload_file: payload_en_500-1000.jsonl - language: en min_length_in_tokens: 1000 max_length_in_tokens: 2000 payload_file: payload_en_1000-2000.jsonl - language: en min_length_in_tokens: 2000 max_length_in_tokens: 3000 payload_file: payload_en_2000-3000.jsonl - language: en min_length_in_tokens: 3000 max_length_in_tokens: 3840 payload_file: payload_en_3000-3840.jsonl # While the tests would run on all the datasets # configured in the experiment entries below but # the price:performance analysis is only done for 1 # dataset which is listed below as the dataset_of_interest metrics: dataset_of_interest: en_2000-3000 # all pricing information is in the pricing.yml file # this file is provided in the repo. You can add entries # to this file for new instance types and new Bedrock models pricing: pricing.yml # inference parameters, these are added to the payload # for each inference request. The list here is not static # any parameter supported by the inference container can be # added to the list. Put the sagemaker parameters in the sagemaker # section, bedrock parameters in the bedrock section (not shown here). # Use the section name (sagemaker in this example) in the inference_spec.parameter_set # section under experiments. inference_parameters: sagemaker: do_sample: yes temperature: 0.1 top_p: 0.92 top_k: 120 max_new_tokens: 100 return_full_text: False # Configuration for experiments to be run. The experiments section is an array # so more than one experiments can be added, these could belong to the same model # but different instance types, or different models, or even different hosting # options (such as one experiment is SageMaker and the other is Bedrock). experiments: - name: llama2-7b-g5.xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0 # model_id is interpreted in conjunction with the deployment_script, so if you # use a JumpStart model id then set the deployment_script to jumpstart.py. # if deploying directly from HuggingFace this would be a HuggingFace model id # see the DJL serving deployment script in the code repo for reference. model_id: meta-textgeneration-llama-2-7b-f model_version: "3.*" model_name: llama2-7b-f ep_name: llama-2-7b-g5xlarge instance_type: "ml.g5.xlarge" image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04' deploy: yes instance_count: 1 # FMBench comes packaged with multiple deployment scripts, such as scripts for JumpStart # scripts for deploying using DJL DeepSpeed, tensorRT etc. You can also add your own. # See repo for details deployment_script: jumpstart.py # FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker # and Bedrock. You can also add your own. See repo for details inference_script: sagemaker_predictor.py inference_spec: # this should match one of the sections in the inference_parameters section above parameter_set: sagemaker # runs are done for each combination of payload file and concurrency level payload_files: - payload_en_1-500.jsonl - payload_en_500-1000.jsonl - payload_en_1000-2000.jsonl - payload_en_2000-3000.jsonl # concurrency level refers to number of requests sent in parallel to an endpoint # the next set of requests is sent once responses for all concurrent requests have # been received. concurrency_levels: - 1 - 2 - 4 # Added for models that require accepting a EULA accept_eula: true # Environment variables to be passed to the container # this is not a fixed list, you can add more parameters as applicable. env: SAGEMAKER_PROGRAM: "inference.py" ENDPOINT_SERVER_TIMEOUT: "3600" MODEL_CACHE_ROOT: "/opt/ml/model" SAGEMAKER_ENV: "1" HF_MODEL_ID: "/opt/ml/model" MAX_INPUT_LENGTH: "4095" MAX_TOTAL_TOKENS: "4096" SM_NUM_GPUS: "1" SAGEMAKER_MODEL_SERVER_WORKERS: "1" - name: llama2-7b-g5.2xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0 model_id: meta-textgeneration-llama-2-7b-f model_version: "3.*" model_name: llama2-7b-f ep_name: llama-2-7b-g5-2xlarge instance_type: "ml.g5.2xlarge" image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04' deploy: yes instance_count: 1 deployment_script: jumpstart.py inference_script: sagemaker_predictor.py inference_spec: parameter_set: sagemaker payload_files: - payload_en_1-500.jsonl - payload_en_500-1000.jsonl - payload_en_1000-2000.jsonl - payload_en_2000-3000.jsonl concurrency_levels: - 1 - 2 - 4 accept_eula: true env: SAGEMAKER_PROGRAM: "inference.py" ENDPOINT_SERVER_TIMEOUT: "3600" MODEL_CACHE_ROOT: "/opt/ml/model" SAGEMAKER_ENV: "1" HF_MODEL_ID: "/opt/ml/model" MAX_INPUT_LENGTH: "4095" MAX_TOTAL_TOKENS: "4096" SM_NUM_GPUS: "1" SAGEMAKER_MODEL_SERVER_WORKERS: "1" report: latency_budget: 2 cost_per_10k_txn_budget: 20 error_rate_budget: 0 per_inference_request_file: per_inference_request_results.csv all_metrics_file: all_metrics.csv txn_count_for_showing_cost: 10000 v_shift_w_single_instance: 0.025 v_shift_w_gt_one_instance: 0.025 ``` ## 🚨 Benchmarking Llama3 on Amazon SageMaker 🚨 Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/)), and you can now benchmark it using `FMBench`. Here are the config files for benchmarking `Llama3-8b-instruct` and `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge` instance. - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-8b-instruct-g5-p4d.yml) for `Llama3-8b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge` - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge` ## Benchmarking Llama2 on Amazon SageMaker Llama2 models are available through SageMaker JumpStart as well as directly deployable from Hugging Face to a SageMaker endpoint. You can use `FMBench` to benchmark Llama2 on SageMaker for different combinations of instance types and inference containers. - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama2-7b-g5-quick.yml) for `Llama2-7b` on `ml.g5.xlarge` and `ml.g5.2xlarge` instances, using the [Hugging Face TGI container](763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04). - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama2-7b-g4dn-g5-trt.yml) for `Llama2-7b` on `ml.g4dn.12xlarge` instance using the [Deep Java Library DeepSpeed container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-deepspeed0.12.6-cu121). - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama2-13b-inf2-g5-p4d.yml) for `Llama2-13b` on `ml.g5.12xlarge`, `ml.inf2.24xlarge` and `ml.p4d.24xlarge` instances using the [Hugging Face TGI container](763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04) and the [Deep Java Library & NeuronX container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-neuronx-sdk2.16.0). - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama2-70b-g5-p4d-trt.yml) for `Llama2-70b` on `ml.p4d.24xlarge` instance using the [Deep Java Library TensorRT container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122). - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama2-70b-inf2-g5.yml) for `Llama2-70b` on `ml.inf2.48xlarge` instance using the [HuggingFace TGI with Optimum NeuronX container](763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.17-neuronx-py310-ubuntu22.04). ## Benchmarking Llama2 on Amazon Bedrock The Llama2-13b-chat and Llama2-70b-chat models are available on [Bedrock](https://aws.amazon.com/bedrock/llama/). You can use `FMBench` to benchmark Llama2 on Bedrock for both on-demand throughput and provisioned throughput inference options. - [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-bedrock.yml) for `Llama2-13b-chat` and `Llama2-70b-chat` on Bedrock for on-demand throughput. - For testing provisioned throughput simply replace the `ep_name` parameter in `experiments` section of the config file with the ARN of your provisioned throughput.