hace 1 año · dd84ee36ae
--- a/demo_apps/llama-on-prem.md
+++ b/demo_apps/llama-on-prem.md
@@ -6,6 +6,10 @@ We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an exa
 
				 
			
 
				 The Colab notebook to connect via LangChain with Llama 2 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg?usp=sharing), also shown in the sections below.
			
 
				 
			
 
				+This tutorial assumes that you you have been granted access to the Meta Llama 2 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) to confirm that you see "Gated model You have been granted access to this model"; if you don't see the "granted access" message, simply follow the instructions under "Access Llama 2 on Hugging Face" in the page. 
			
 
				+
			
 
				+You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens).
			
 
				+
			
 
				 ## Setting up vLLM with Llama 2
			
 
				 
			
 
				 On a terminal, run the following commands:
			
@@ -18,6 +22,8 @@ git clone https://github.com/vllm-project/vllm
 
				 cd vllm/vllm/entrypoints/
			
 
				 ```
			
 
				 
			
 
				+Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login.
			
 
				+
			
 
				 There are two ways to deploy Llama 2 via vLLM, as a general API server or an OpenAI-compatible server.
			
 
				 
			
 
				 ### Deploying Llama 2 as an API Server
			
@@ -111,9 +117,7 @@ You can now use the Llama 2 instance `llm` created this way in any of the [Llama
 
				 
			
 
				 ## Setting Up TGI with Llama 2
			
 
				 
			
 
				-The easiest way to deploy Llama 2 with TGI is using TGI's official docker image. First, make sure you have been granted access to the Meta Llama 2 on Hugging Face by opening the Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and confirming you see "Gated model You have been granted access to this model". If you don't see the "granted access" message, simply follow the instructions under "Access Llama 2 on Hugging Face" in the page.
			
 
				-
			
 
				-Then copy your Hugging Face access token, which you can create for free at your [tokens page](https://huggingface.co/settings/tokens) and set it as the value of one of the three required shell variables:
			
 
				+The easiest way to deploy Llama 2 with TGI is using TGI's official docker image. First, replace `<your Hugging Face access token>` and set the three required shell variables (you may replace the `model` value above with another Llama 2 model):
			
 
				 
			
 
				 ```
			
 
				 model=meta-llama/Llama-2-13b-chat-hf
			
@@ -121,9 +125,7 @@ volume=$PWD/data
 
				 token=<your Hugging Face access token>
			
 
				 ```
			
 
				 
			
 
				-You may replace the `model` value above with another Llama 2 model.
			
 
				-
			
 
				-Finally, run the command below to deploy a quantized version of the Llama 2 13b-chat model with TGI:
			
 
				+Then run the command below to deploy a quantized version of the Llama 2 13b-chat model with TGI:
			
 
				 
			
 
				 ```
			
 
				 docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.2 --model-id $model  --quantize bitsandbytes-nf4