This notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure. We will cover:
Before we start building with Azure Llama 2 APIs, there are certain steps we need to take to deploy the models:
Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.
For more information, you should consult Azure's official documentation here for model deployment and inference.
For using the REST API, You will need to have an Endpoint url and Authentication Key associated with that endpoint.
This can be acquired from previous steps.
In this text completion example for pre-trained model, we use a simple curl call for illustration. There are three major components:
host-url
is your endpoint url with completion schema. headers
defines the content type as well as your api key. payload
or data
, which is your prompt detail and model hyper parameters.!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"prompt": "Math is a", "max_tokens": 30, "temperature": 0.7}'
For chat completion, the API schema and request payload are slightly different.
The host-url
needs to be /v1/chat/completions
and the request payload to include roles in conversations. Here is a sample payload:
{
"messages": [
{
"content": "You are a helpful assistant.",
"role": "system"
},
{
"content": "Hello!",
"role": "user"
}
],
"max_tokens": 50,
}
Here is a sample curl call for chat completion
!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"messages":[{"content":"You are a helpful assistant.","role":"system"},{"content":"Who wrote the book Innovators dilemma?","role":"user"}], "max_tokens": 50}'
If you compare the generation result for both text and chat completion API calls, you will notice that:
choices
for the input prompt, each contains generated text and completion information such as logprobs
.choices
each with a message
object with completion result, matching the messages
object in the request. One fantastic feature the API offers is the streaming capability.
Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.
This is extremely important for interactive applications such as chatbots, so the user is always engaged.
To use streaming, simply set "stream":"True"
as part of the request payload.
In the streaming mode, the REST API response will be different from non-streaming mode.
Here is an example:
!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"messages":[{"content":"You are a helpful assistant.","role":"system"},{"content":"Who wrote the book Innovators dilemma?","role":"user"}], "max_tokens": 500, "stream": "True"}'
As you can see the result comes back as a stream of data
objects, each contains generated information including a choice
.
The stream terminated by a data:[DONE]\n\n
message.
All Azure Llama 2 API endpoints have content safety feature turned on. Both input prompt and output tokens are filtered by this service automatically.
To know more about the impact to the request/response payload, please refer to official guide here.
For model input and output, if the filter detects there is harmful content, the generation will error out with a response payload containing the reasoning, along with information on the type of content violation and its severity.
Here is an example prompt that triggered content safety filtering:
!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"messages":[{"content":"You are a helpful assistant.","role":"system"},{"content":"How to make bomb?","role":"user"}], "max_tokens": 50}'
Besides calling the API directly from command line tools, you can also programatically call them in Python.
Here is an example for the text completion model:
import urllib.request
import json
#Configure payload data sending to API endpoint
data = {"prompt": "Math is a",
"max_tokens": 30,
"temperature": 0.7,
"top_p": 0.9,
}
body = str.encode(json.dumps(data))
#Replace the url with your API endpoint
url = 'https://your-endpoint.inference.ai.azure.com/v1/completions'
#Replace this with the key for the endpoint
api_key = 'your-auth-key'
if not api_key:
raise Exception("API Key is missing")
headers = {'Content-Type':'application/json', 'Authorization':(api_key)}
req = urllib.request.Request(url, body, headers)
try:
response = urllib.request.urlopen(req)
result = response.read()
print(result)
except urllib.error.HTTPError as error:
print("The request failed with status code: " + str(error.code))
# Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
print(error.info())
print(error.read().decode("utf8", 'ignore'))
Chat completion in Python is very similar, here is a quick example:
import urllib.request
import json
#Configure payload data sending to API endpoint
data = {"messages":[
{"role":"system", "content":"You are a helpful assistant."},
{"role":"user", "content":"Who wrote the book Innovators dilemma?"}],
"max_tokens": 500,
"temperature": 0.9,
"stream": "True",
}
body = str.encode(json.dumps(data))
#Replace the url with your API endpoint
url = 'https://your-endpoint.inference.ai.azure.com/v1/chat/completions'
#Replace this with the key for the endpoint
api_key = 'your-auth-key'
if not api_key:
raise Exception("API Key is missing")
headers = {'Content-Type':'application/json', 'Authorization':(api_key)}
req = urllib.request.Request(url, body, headers)
try:
response = urllib.request.urlopen(req)
result = response.read()
print(result)
except urllib.error.HTTPError as error:
print("The request failed with status code: " + str(error.code))
# Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
print(error.info())
print(error.read().decode("utf8", 'ignore'))
However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize the requests
library instead.
Requests
library is a simple HTTP library for Python built with urllib3
. It automatically maintains the keep-alive and HTTP connection pooling. With the Session
class, we can easily stream the result from our API calls.
Here is a quick example:
import json
import requests
data = {"messages":[
{"role":"system", "content":"You are a helpful assistant."},
{"role":"user", "content":"Who wrote the book Innovators dilemma?"}],
"max_tokens": 500,
"temperature": 0.9,
"stream": "True"
}
def post_stream(url):
s = requests.Session()
api_key = "your-auth-key"
headers = {'Content-Type':'application/json', 'Authorization':(api_key)}
with s.post(url, data=json.dumps(data), headers=headers, stream=True) as resp:
print(resp.status_code)
for line in resp.iter_lines():
if line:
print(line)
url = "https://your-endpoint.inference.ai.azure.com/v1/chat/completions"
post_stream(url)
In this section, we will demonstrate how to use Llama 2 APIs with LangChain, one of the most popular framework to accelerate building your AI product.
One common solution here is to create your customized LLM instance, so you can add it to various chains to complete different tasks.
In this example, we will use the AzureMLOnlineEndpoint
class LangChain provides to build a customized LLM instance. This particular class is designed to take in Azure endpoint and API keys as inputs and wire it with HTTP calls. So the underlying of it is very similar to how we used urllib.request
library to send RESTful calls in previous examples to the Azure Endpoint.
Note Azure is working on a standard solution for LangChain integration in this PR, you should consider migrating to that in the future.
First, let's install dependencies:
pip install langchain
Once all dependencies are installed, you can directly create a llm
instance based on AzureMLOnlineEndpoint
as follows:
from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase
from typing import Dict
import json
class AzureLlamaAPIContentFormatter(ContentFormatterBase):
#Content formatter for Llama 2 API for Azure MaaS
def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:
#Formats the request according to the chosen api
prompt = ContentFormatterBase.escape_special_characters(prompt)
request_payload_dict = {
"messages": [
{"role":"system", "content":"You are a helpful assistant"},
{"role":"user", "content":f"{prompt}"}
]
}
#Add model parameters as part of the dict
request_payload_dict.update(model_kwargs)
request_payload = json.dumps(request_payload_dict)
return str.encode(request_payload)
def format_response_payload(self, output: bytes) -> str:
#Formats response
return json.loads(output)["choices"][0]["message"]["content"]
content_formatter = AzureLlamaAPIContentFormatter()
llm = AzureMLOnlineEndpoint(
endpoint_api_key="your-auth-key",
endpoint_url="https://your-endpoint.inference.ai.azure.com/v1/chat/completions",
model_kwargs={"temperature": 0.6, "max_tokens": 512, "top_p": 0.9},
content_formatter=content_formatter,
)
However, you might wonder what is the content_formatter
in the context when creating the llm
instance?
The content_formatter
parameter is a handler class for transforming the request and response of an AzureML endpoint to match with required schema. Since there are various models in the Azure model catalog, each of which needs to handle the data accordingly.
In our case, all current formatters provided by Langchain including LLamaContentFormatter
don't follow the schema. So we created our own customized formatter called AzureLlamaAPIContentFormatter
to handle the input and output data.
Once you have the llm
ready, you can simple inference it by:
print(llm("Who wrote the book Innovators dilemma?"))
Here is an example that you can create a translator chain with the llm
instance and translate English to French:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
template = """
You are a Translator. Translate the following content from {input_language} to {output_language} and reply with only the translated result.
{input_content}
"""
translator_chain = LLMChain(
llm = llm,
prompt = PromptTemplate(
template=template,
input_variables=["input_language", "output_language", "input_content"],
),
)
print(translator_chain.run(input_language="English", output_language="French", input_content="Who wrote the book Innovators dilemma?"))
At the time of writing this sample notebook, LangChain doesn't support streaming with AzureMLOnlineEndpoint
for Llama 2. We are working with LangChain and Azure team to implement that.
In this section, we will build a simple chatbot using Azure Llama 2 API, LangChain and Gradio's ChatInterface
with memory capability.
Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot example built with Llama 2 on-premises with RAG.
First, let's install Gradio dependencies.
pip install gradio
Let's use AzureMLOnlineEndpoint
class from the previous example.
In this example, we have three major components:
All of them are chained together using LangChain.
import gradio as gr
from langchain.chains import ConversationChain
from langchain.prompts import PromptTemplate
from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase
from langchain.memory import ConversationBufferWindowMemory
import langchain
from typing import Dict
import json
langchain.debug=True
class AzureLlamaAPIContentFormatter(ContentFormatterBase):
#Content formatter for Llama 2 API for Azure MaaS
def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:
#Formats the request according to the chosen api
prompt = ContentFormatterBase.escape_special_characters(prompt)
#Note how we instructed the model with system prompts. Past conversation can be past as in system prompt as well
request_payload_dict = {
"messages": [
{"role":"system", "content":"The following is a conversation between a user and you. Answer the user question based on the conversation. Provide your answer only"},
{"role":"user", "content":f"{prompt}"}
]
}
request_payload_dict.update(model_kwargs)
request_payload = json.dumps(request_payload_dict)
return str.encode(request_payload)
def format_response_payload(self, output: bytes) -> str:
#Formats response
return json.loads(output)["choices"][0]["message"]["content"]
#Create content fomartter
content_formatter = AzureLlamaAPIContentFormatter()
#Create llm instance
llm = AzureMLOnlineEndpoint(
endpoint_api_key="your-auth-key",
endpoint_url="https://your-endpoint.inference.ai.azure.com/v1/chat/completions",
model_kwargs={"temperature": 0.6, "max_tokens": 128, "top_p": 0.9},
content_formatter=content_formatter,
)
#Create memory
memory = ConversationBufferWindowMemory(llm=llm, k=5, memory_key="chat_history", ai_prefix="Assistant", human_prefix="User")
#Create input prompt template with chat history for chaining
INPUT_TEMPLATE = """Current conversation:
{chat_history}
User question:{input}"""
conversation_prompt_template = PromptTemplate(
input_variables=["chat_history", "input"], template=INPUT_TEMPLATE
)
conversation_chain_with_memory = ConversationChain(
llm = llm,
prompt = conversation_prompt_template,
verbose = True,
memory = memory,
)
#Prediction
def predict(message, history):
history_format = []
for user, assistant in history:
history_format.append({"role": "user", "content": user })
history_format.append({"role": "assistant", "content":assistant})
history_format.append({"role": "user", "content": message})
response = conversation_chain_with_memory.run(input=message)
return response
#Launch Gradio chatbot interface
gr.ChatInterface(predict).launch()
After successfully executing the code above, a chat interface should appear as the interactive output or you can open the localhost url in your selected browser window.
This concludes our tutorial and examples. Here are some additional reference: