Llama 2 On-Prem Inference Using vLLM and TGI

Enterprise customers may prefer to deploy Llama 2 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 2 with vLLM and Hugging Face TGI, two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 2 instances with LangChain, an open-source LLM app development framework which we used for our earlier demo apps with Llama 2 running on local Mac or Replicate cloud.

We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an example of running vLLM and TGI with Llama 2, and you can replace this with your own server to implement on-prem Llama 2 deployment.

The Colab notebook to connect via LangChain with Llama 2 hosted as the vLLM and TGI API services is here, also shown in the sections below.

Setting up vLLM with Llama 2

On a terminal, run the following commands:

conda create -n vllm python=3.8
conda activate vllm
pip install vllm
cd <your_work_folder>
git clone https://github.com/vllm-project/vllm
cd vllm/vllm/entrypoints/

There are two ways to deploy Llama 2 via vLLM, as a general API server or an OpenAI-compatible server.

Deploying Llama 2 as an API Server

Run the command below to deploy vLLM as a general Llama 2 service:

python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Llama-2-7b-chat-hf

Then on another terminal you can run:

curl http://localhost:5000/generate -d '{
        "prompt": "Who wrote the book Innovators dilemma?",
        "max_tokens": 300,
        "temperature": 0
    }'

to send a query (prompt) to Llama 2 via vLLM and get Llama's response:

{"text":["Who wrote the book Innovators dilemma?\n\nThe book \"Innovator's Dilemma\" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the \"innovator's dilemma,\" which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline."]}

Now in your Llama client app, you can make an HTTP request as the curl command above to send a query to Llama and parse the response.

If you add the port 5000 to your EC2 instance's security group's inbound rules, then you can run this on your Mac/Windows for test:

curl http://<EC2_public_ip>:5000/generate -d '{
        "prompt": "Who wrote the book godfather?",
        "max_tokens": 300,
        "temperature": 0
    }'

Deploying Llama 2 as OpenAI-Compatible Server

You can also deploy the vLLM hosted Llama 2 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:

python openai/api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Llama-2-7b-chat-hf

Then on another terminal, run:

curl http://localhost:5000/v1/completions -H "Content-Type: application/json" -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "Who wrote the book Innovators dilemma?",
        "max_tokens": 300,
        "temperature": 0
    }'

and you'll see the following result:

{"id":"cmpl-3eae7061b2","object":"text_completion","created":3616,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"text":"\n\nThe book \"Innovator's Dilemma\" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the \"innovator's dilemma,\" which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline.","logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":13,"total_tokens":153,"completion_tokens":140}}

Querying with Llama 2 via vLLM

On a Google Colab notebook, first install two packages:

!pip install langchain openai

Note that we only need to install the openai package with an EMPTY OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 2.

Then replace the below and run the code:

from langchain.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://<vllm_server_ip_address>:5000/v1",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    model_kwargs={
        "max_new_token": 300
        }
)

print(llm("Who wrote the book godfather?"))

You'll see an answer like:

The book "The Godfather" was written by Mario Puzo. It was first published in 1969 and has since become a classic of American literature. The book was later adapted into a successful film directed by Francis Ford Coppola, which was released in 1972.

You can now use the Llama 2 instance llm created this way in any of the Llama demo apps or your own Llama apps to integrate seamlessly with LangChain and LlamaIndex to build powerful on-prem Llama apps.

Setting Up TGI with Llama 2

The easiest way to deploy Llama 2 with TGI is using TGI's official docker image. First, make sure you have been granted access to the Meta Llama 2 on Hugging Face by opening the Hugging Face Meta model page here and confirming you see "Gated model You have been granted access to this model". If you don't see the "granted access" message, simply follow the instructions under "Access Llama 2 on Hugging Face" in the page.

Then copy your Hugging Face access token, which you can create for free at your tokens page and set it as the value of one of the three required shell variables:

model=meta-llama/Llama-2-13b-chat-hf
volume=$PWD/data
token=<your Hugging Face access token>

You may replace the model value above with another Llama 2 model.

Finally, run the command below to deploy a quantized version of the Llama 2 13b-chat model with TGI:

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.2 --model-id $model  --quantize bitsandbytes-nf4

After this, you'll be able to run the command below on another terminal:

curl 127.0.0.1:8080/generate -X POST -H 'Content-Type: application/json' -d '{
        "inputs": "Who wrote the book innovators dilemma?",
        "parameters": {
            "max_new_tokens":200
        }
    }'

and see the answer generated by Llama 2 via TGI:

{"generated_text":"\n\nThe book \"The Innovator's Dilemma\" was written by Clayton Christensen, a professor at Harvard Business School. It was first published in 1997 and has since become a widely recognized and influential book on the topic of disruptive innovation."}

Querying with Llama 2 via TGI

Using LangChain to integrate with TGI-hosted Llama 2 is also straightforward. In the Colab above, first add a new code cell to install the Hugging Face text_generation package:

!pip install text_generation

Then add and run the code below:

llm = HuggingFaceTextGenInference(
    inference_server_url="http://<tgi_server_ip_address>:8080/",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

llm("What wrote the book godfather?")

With the Llama 2 instance llm created this way, you can integrate seamlessly with LangChain and LlamaIndex to build powerful on-prem Llama 2 apps such as the Llama demo apps.

llama-on-prem.md 8.3 KB Історія Запис