1 年之前 · 95db9a0193
--- a/demo_apps/llama-on-prem.md
+++ b/demo_apps/llama-on-prem.md
@@ -50,7 +50,7 @@ to send a query (prompt) to Llama 2 via vLLM and get Llama's response:
 
				 
			
 
				 Now in your Llama client app, you can make an HTTP request as the `curl` command above to send a query to Llama and parse the response.
			
 
				 
			
 
				-If you add the port 5000 to your EC2 instance's security group's inbound rules, then you can run this on your Mac/Windows for test:
			
 
				+If you add the port 5000 to your EC2 instance's security group's inbound rules with the TCP protocol, then you can run this on your Mac/Windows for test:
			
 
				 
			
 
				 ```
			
 
				 curl http://<EC2_public_ip>:5000/generate -d '{
			
@@ -60,6 +60,12 @@ curl http://<EC2_public_ip>:5000/generate -d '{
 
				     }'
			
 
				 ```
			
 
				 
			
 
				+Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argument when starting the server (see [here](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html) for more info). For example, the command below runs the Llama 2 13b-chat model on 4 GPUs:
			
 
				+
			
 
				+```
			
 
				+python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Llama-2-13b-chat-hf --tensor-parallel-size 4
			
 
				+```
			
 
				+
			
 
				 ### Deploying Llama 2 as OpenAI-Compatible Server
			
 
				 
			
 
				 You can also deploy the vLLM hosted Llama 2 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below: