Amit Arora 706961250e add <!-- markdown-link-check-disable --> to bypass URL checks for ECR URLs | 6 hónapja | |
---|---|---|
.. | ||
img | 7 hónapja | |
README.md | 6 hónapja | |
config.yml | 7 hónapja |
The FMBench
tool provides a quick and easy way to benchmark the Llama family of models for price and performance on any AWS service including Amazon SagMaker
, Amazon Bedrock
or Amazon EKS
or Amazon EC2
as Bring your own endpoint
.
Customers often wonder what is the best AWS service to run Llama models for my specific use-case and my specific price performance requirements. While model evaluation metrics are available on several leaderboards (HELM
, LMSys
), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as (LongBench
, QMSum
). This is the problem that FMBench
solves.
FMBench
: an open-source Python package for FM benchmarking on AWSFMBench
runs inference requests against endpoints that are either deployed through FMBench
itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given Llama model for a given use-case.
The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the Llama2-13b
model on different instance types available on SageMaker using prompts for Q&A task created from the LongBench
dataset, these prompts are between 3000 to 3840 tokens in length. Note that the numbers are hidden in this figure but you would be able to see them when you run FMBench
yourself.
The following table (also included in the report) provides information about the best available instance type for that experiment1.
Information | Value |
---|---|
experiment_name | llama2-13b-inf2.24xlarge |
payload_file | payload_en_3000-3840.jsonl |
instance_type | ml.inf2.24xlarge |
concurrency | ** |
error_rate | ** |
prompt_token_count_mean | 3394 |
prompt_token_throughput | 2400 |
completion_token_count_mean | 31 |
completion_token_throughput | 15 |
latency_mean | ** |
latency_p50 | ** |
latency_p95 | ** |
latency_p99 | ** |
transactions_per_minute | ** |
price_per_txn | ** |
1 ** represent values hidden on purpose, these are available when you run the tool yourself.
The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).
FMBench
The following steps provide a Quick start guide for FMBench
. For a more detailed DIY version, please see the FMBench Readme
.
Each FMBench
run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical FMBench
workflow involves either directly using an already provided config file from the configs
folder in the FMBench
GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).
A simple config file with key parameters annotated is included in this repo, see
config.yml
. This file benchmarks performance of Llama2-7b on anml.g5.xlarge
instance and anml.g5.2xlarge
instance. You can use this provided config file as it is for this Quickstart.
Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench
and a write S3 bucket is created which will hold the metrics and reports generated by FMBench
. The CloudFormation stack takes about 5-minutes to create.
|AWS Region | Link | |:------------------------:|:-----------:| |us-east-1 (N. Virginia) | | |us-west-2 (Oregon) | |
Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook
.
On the fmbench-notebook
open a Terminal and run the following commands.
conda create --name fmbench_python311 -y python=3.11 ipykernel
source activate fmbench_python311;
pip install -U fmbench
Now you are ready to fmbench
with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.
We benchmark performance for the Llama2-7b
model on a ml.g5.xlarge
and a ml.g5.2xlarge
instance type, using the huggingface-pytorch-tgi-inference
inference container. This test would take about 30 minutes to complete and cost about $0.20.
It uses a simple relationship that 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer
. It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided here on how to use a custom tokenizer.
account=`aws sts get-caller-identity | jq .Account | tr -d '"'`
region=`aws configure get region`
fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/config-llama2-7b-g5-quick.yml >> fmbench.log 2>&1
Open another terminal window and do a tail -f
on the fmbench.log
file to see all the traces being generated at runtime.
tail -f fmbench.log
The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id>
bucket. The metrics and report files are also downloaded locally and in the results
directory (created by FMBench
) and the benchmarking report is available as a markdown file called report.md
in the results
directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.
Llama3 is now available on Bedrock (read blog post), and you can now benchmark it using FMBench
. Here is the config file for benchmarking Llama3-8b-instruct
and Llama3-70b-instruct
on Bedrock.
Llama3-8b-instruct
and Llama3-70b-instruct
.
Llama3 is now available on SageMaker (read blog post), and you can now benchmark it using FMBench
. Here are the config files for benchmarking Llama3-8b-instruct
and Llama3-70b-instruct
on ml.p4d.24xlarge
and ml.g5.12xlarge
instance.
Llama3-8b-instruct
on ml.p4d.24xlarge
and ml.g5.12xlarge
.Llama3-70b-instruct
on ml.p4d.24xlarge
and ml.g5.48xlarge
.
Llama2 models are available through SageMaker JumpStart as well as directly deployable from Hugging Face to a SageMaker endpoint. You can use FMBench
to benchmark Llama2 on SageMaker for different combinations of instance types and inference containers.
Llama2-7b
on ml.g5.xlarge
and ml.g5.2xlarge
instances, using the Hugging Face TGI container.Llama2-7b
on ml.g4dn.12xlarge
instance using the Deep Java Library DeepSpeed container.Llama2-13b
on ml.g5.12xlarge
, ml.inf2.24xlarge
and ml.p4d.24xlarge
instances using the Hugging Face TGI container and the Deep Java Library & NeuronX container.Llama2-70b
on ml.p4d.24xlarge
instance using the Deep Java Library TensorRT container.Llama2-70b
on ml.inf2.48xlarge
instance using the HuggingFace TGI with Optimum NeuronX container.
The Llama2-13b-chat and Llama2-70b-chat models are available on Bedrock. You can use FMBench
to benchmark Llama2 on Bedrock for both on-demand throughput and provisioned throughput inference options.
Config file for Llama2-13b-chat
and Llama2-70b-chat
on Bedrock for on-demand throughput.
For testing provisioned throughput simply replace the ep_name
parameter in experiments
section of the config file with the ARN of your provisioned throughput.
For bug reports, enhancement requests and any questions please create a GitHub issue on the FMBench
repo.