1 year ago · 6cd89b7d38
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 
				 # Llama 2 Fine-tuning / Inference Recipes, Examples and Demo Apps
			
 
				 
			
 
				-**[Update Nov. 16, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**
			
 
				+**[Update Dec. 11, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem), how to use Azure Llama 2 API (Model-as-a-Service), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**
			
 
				 
			
 
				 The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).
			
 
				 
			
@@ -185,6 +185,7 @@ This folder contains a series of Llama2-powered apps:
 
				 3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
			
 
				 4. Llama on-prem with vLLM and TGI
			
 
				 5. Llama chatbot with RAG (Retrieval Augmented Generation)
			
 
				+6. Azure Llama 2 API (Model-as-a-Service)
			
 
				 
			
 
				 * Specialized Llama use cases:
			
 
				 1. Ask Llama to summarize a video content
			
--- a/demo_apps/Azure_API_example/azure_api_example.ipynb
+++ b/demo_apps/Azure_API_example/azure_api_example.ipynb
@@ -0,0 +1,610 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Use Azure API with Llama 2\n",
			
 
				+    "\n",
			
 
				+    "This notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure. We will cover:  \n",
			
 
				+    "* HTTP requests API usage for Llama 2 70B pretrained and chat models in CLI\n",
			
 
				+    "* HTTP requests API usage for Llama 2 70B pretrained and chat models in Python\n",
			
 
				+    "* Plug the APIs into LangChain\n",
			
 
				+    "* Wire the model with Gradio to build a simple chatbot with memory\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Prerequisite\n",
			
 
				+    "\n",
			
 
				+    "Before we start building with Azure Llama 2 APIs, there are certain steps we need to take to deploy the models:\n",
			
 
				+    "\n",
			
 
				+    "* Register for a valid Azure account with subscription \n",
			
 
				+    "* Make sure you have access to [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home)\n",
			
 
				+    "* Create a project and resource group\n",
			
 
				+    "* Select Llama models from Model catalog\n",
			
 
				+    "* Deploy with \"Pay-as-you-go\"\n",
			
 
				+    "\n",
			
 
				+    "Once deployed successfully, you should be assigned for an API endpoint and a security key for inference. You can also deploy the model by using Azure ML Python SDK.   \n",
			
 
				+    "\n",
			
 
				+    "For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## HTTP Requests API Usage in CLI\n",
			
 
				+    "\n",
			
 
				+    "### Basics\n",
			
 
				+    "\n",
			
 
				+    "For using the REST API, You will need to have a Endpoint url and Authentication Key associated with that endpoint.  \n",
			
 
				+    "This can be acquired from previous steps.  \n",
			
 
				+    "\n",
			
 
				+    "In this text completion example for 70B pre-trained model, we use a simple curl call for illustration. There are three major components:  \n",
			
 
				+    "\n",
			
 
				+    "* The `host-url` is your endpoint url with completion schema. \n",
			
 
				+    "* The `headers` defines the content type as well as your api key. \n",
			
 
				+    "* The `payload` or `data`, which is your prompt detail and model hyper parameters."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"prompt\": \"Math is a\", \"max_tokens\": 30, \"temperature\": 0.7}' "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "For chat completion, the API schema and request payload are slightly different.\n",
			
 
				+    "\n",
			
 
				+    "For `host-url` the path changed to `/v1/chat/completions` and the request payload also changed to include roles in conversations. Here is a sample payload:  \n",
			
 
				+    "\n",
			
 
				+    "```\n",
			
 
				+    "{ \n",
			
 
				+    "  \"messages\": [ \n",
			
 
				+    "    { \n",
			
 
				+    "      \"content\": \"You are a helpful assistant.\", \n",
			
 
				+    "      \"role\": \"system\" \n",
			
 
				+    "},  \n",
			
 
				+    "    { \n",
			
 
				+    "      \"content\": \"Hello!\", \n",
			
 
				+    "      \"role\": \"user\" \n",
			
 
				+    "    } \n",
			
 
				+    "  ], \n",
			
 
				+    "  \"max_tokens\": 50, \n",
			
 
				+    "} \n",
			
 
				+    "```\n",
			
 
				+    "\n",
			
 
				+    "Here is a sample curl call for chat completion"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"What is good about Wuhan?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "If you compare the generation result for both text and chat completion API calls, you will notice that:  \n",
			
 
				+    "\n",
			
 
				+    "* Text completion returns a list of `choices` for the input prompt, each contains generated text and completion information such as `logprobs`.\n",
			
 
				+    "* Chat completion returns a list of `cnoices` each has a `message` object with completion result and using the same `message` object in the request.  \n",
			
 
				+    "\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Streaming\n",
			
 
				+    "\n",
			
 
				+    "One fantastic feature the API offered is the streaming capability.  \n",
			
 
				+    "Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.  \n",
			
 
				+    "This is extremely important for interactive applications such as chatbots, so the user is always engaged.  \n",
			
 
				+    "\n",
			
 
				+    "To use streaming, simply set `\"stream\":\"True\"` as part of the request payload.  \n",
			
 
				+    "In the streaming mode, the REST API response will be different from non-streaming mode.\n",
			
 
				+    "\n",
			
 
				+    "Here is an example: "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"What is good about Wuhan?\",\"role\":\"user\"}], \"max_tokens\": 500, \"stream\": \"True\"}'"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "As you can see the result comes back as a stream of `data` objects, each contains generated information including a `choice`.  \n",
			
 
				+    "The stream terminated by a `data:[DONE]\\n\\n` message."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Content Safety Filtering\n",
			
 
				+    "\n",
			
 
				+    "All Azure Llama 2 API endpoint will have content safety feature turned on. Both input prompt and output tokens are filtered by this service automatically.  \n",
			
 
				+    "To know more about the impact to the request/response payload, please refer to official guide [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=python).   \n",
			
 
				+    "\n",
			
 
				+    "For model input and output, if the filter detected there is harmful content. The generation will error out with reponse payload containing the reasoning, along with which type of content violation it is and severity.  \n",
			
 
				+    "\n",
			
 
				+    "Here is an example prompt that triggered content safety filtering:\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"How to make bomb?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## HTTP Requests API Usage in Python\n",
			
 
				+    "\n",
			
 
				+    "Besides calling the API directly from command line tools. You can also programatically call them in Python.  \n",
			
 
				+    "\n",
			
 
				+    "Here is an example for text completion model:\n",
			
 
				+    "\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import urllib.request\n",
			
 
				+    "import json\n",
			
 
				+    "\n",
			
 
				+    "#Configure payload data sending to API endpoint\n",
			
 
				+    "data = {\"prompt\": \"Math is a\", \n",
			
 
				+    "         \"max_tokens\": 30, \n",
			
 
				+    "         \"temperature\": 0.7,\n",
			
 
				+    "         \"top_p\": 0.9,      \n",
			
 
				+    "}\n",
			
 
				+    "\n",
			
 
				+    "body = str.encode(json.dumps(data))\n",
			
 
				+    "\n",
			
 
				+    "#Replace the url with your API endpoint\n",
			
 
				+    "url = 'https://your-endpoint.inference.ai.azure.com/v1/completions'\n",
			
 
				+    "\n",
			
 
				+    "#Replace this with the key for the endpoint\n",
			
 
				+    "api_key = 'your-auth-key'\n",
			
 
				+    "if not api_key:\n",
			
 
				+    "    raise Exception(\"API Key is missing\")\n",
			
 
				+    "\n",
			
 
				+    "headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				+    "req = urllib.request.Request(url, body, headers)\n",
			
 
				+    "\n",
			
 
				+    "try:\n",
			
 
				+    "    response = urllib.request.urlopen(req)\n",
			
 
				+    "    result = response.read()\n",
			
 
				+    "    print(result)\n",
			
 
				+    "except urllib.error.HTTPError as error:\n",
			
 
				+    "    print(\"The request failed with status code: \" + str(error.code))\n",
			
 
				+    "    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure\n",
			
 
				+    "    print(error.info())\n",
			
 
				+    "    print(error.read().decode(\"utf8\", 'ignore'))\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Chat completion in Python is very similar, here is a quick example:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import urllib.request\n",
			
 
				+    "import json\n",
			
 
				+    "\n",
			
 
				+    "#Configure payload data sending to API endpoint\n",
			
 
				+    "data = {\"messages\":[\n",
			
 
				+    "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				+    "            {\"role\":\"user\", \"content\":\"What is good about Wuhan?\"}], \n",
			
 
				+    "        \"max_tokens\": 500,\n",
			
 
				+    "        \"temperature\": 0.9,\n",
			
 
				+    "        \"stream\": \"True\",\n",
			
 
				+    "}\n",
			
 
				+    "\n",
			
 
				+    "body = str.encode(json.dumps(data))\n",
			
 
				+    "\n",
			
 
				+    "#Replace the url with your API endpoint\n",
			
 
				+    "url = 'https://your-endpoint.inference.ai.azure.com/v1/chat/completions'\n",
			
 
				+    "\n",
			
 
				+    "#Replace this with the key for the endpoint\n",
			
 
				+    "api_key = 'your-auth-key'\n",
			
 
				+    "if not api_key:\n",
			
 
				+    "    raise Exception(\"API Key is missing\")\n",
			
 
				+    "\n",
			
 
				+    "headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				+    "\n",
			
 
				+    "req = urllib.request.Request(url, body, headers)\n",
			
 
				+    "\n",
			
 
				+    "try:\n",
			
 
				+    "    response = urllib.request.urlopen(req)\n",
			
 
				+    "    result = response.read()\n",
			
 
				+    "    print(result)\n",
			
 
				+    "except urllib.error.HTTPError as error:\n",
			
 
				+    "    print(\"The request failed with status code: \" + str(error.code))\n",
			
 
				+    "    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure\n",
			
 
				+    "    print(error.info())\n",
			
 
				+    "    print(error.read().decode(\"utf8\", 'ignore'))\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize [`requests`](https://requests.readthedocs.io/en/latest/) library instead."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Streaming in Python\n",
			
 
				+    "\n",
			
 
				+    "`Requests` library is a simple HTTP library for Python built with [`urllib3`](https://github.com/urllib3/urllib3). It automatically maintains the keep-alive and HTTP connection pooling. With the `Session` class, we can easily stream the result from our API calls.  \n",
			
 
				+    "\n",
			
 
				+    "Here is a quick example:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import json\n",
			
 
				+    "import requests\n",
			
 
				+    "\n",
			
 
				+    "data = {\"messages\":[\n",
			
 
				+    "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				+    "            {\"role\":\"user\", \"content\":\"What is good about Wuhan?\"}],\n",
			
 
				+    "        \"max_tokens\": 500,\n",
			
 
				+    "        \"temperature\": 0.9,\n",
			
 
				+    "        \"stream\": \"True\"\n",
			
 
				+    "}\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "def post_stream(url):\n",
			
 
				+    "    s = requests.Session()\n",
			
 
				+    "    api_key = \"your-auth-key\"\n",
			
 
				+    "    headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				+    "\n",
			
 
				+    "    with s.post(url, data=json.dumps(data), headers=headers, stream=True) as resp:\n",
			
 
				+    "        print(resp.status_code)\n",
			
 
				+    "        for line in resp.iter_lines():\n",
			
 
				+    "            if line:\n",
			
 
				+    "                print(line)\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "url = \"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\"\n",
			
 
				+    "post_stream(url)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Use Llama 2 API with LangChain\n",
			
 
				+    "\n",
			
 
				+    "In this section, we will demonstrate how to use Llama 2 APIs with LangChain, one of the most popoular framework to accelerate building your AI product.  \n",
			
 
				+    "One common solution here is to create your customized LLM instance, so you can add it to various chains to complete different tasks.  \n",
			
 
				+    "In this example, we will use `AzureMLOnlineEndpoint` class LangChain provided to build this customized LLM instance. This particular class is designed to take in Azure endpoint and API keys as inputs and wired it with HTTP calls. So the underlying of it is very similar to how we used `urllib.request` library to send RESTful calls in previous examples to Azure Endpoint.   \n",
			
 
				+    "\n",
			
 
				+    "Note Azure is working on a standard solution for LangChain integration in this [PR](https://github.com/langchain-ai/langchain/pull/14481), you should consider migrating to that in the future. \n",
			
 
				+    "\n",
			
 
				+    "First, let's install dependencies: \n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "pip install langchain"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Once all dependencies installed, you can directly create a `llm` instance based on `AzureMLOnlineEndpoint` as follow:  "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase\n",
			
 
				+    "from typing import Dict\n",
			
 
				+    "import json\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "class AzureLlamaAPIContentFormatter(ContentFormatterBase):\n",
			
 
				+    "#Content formatter for Llama 2 API for Azure MaaS\n",
			
 
				+    "\n",
			
 
				+    "    def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:\n",
			
 
				+    "        #Formats the request according to the chosen api\n",
			
 
				+    "        prompt = ContentFormatterBase.escape_special_characters(prompt)\n",
			
 
				+    "        request_payload_dict = {\n",
			
 
				+    "                \"messages\": [\n",
			
 
				+    "                    {\"role\":\"system\", \"content\":\"You are a helpful assistant\"},\n",
			
 
				+    "                    {\"role\":\"user\", \"content\":f\"{prompt}\"}\n",
			
 
				+    "                    ]               \n",
			
 
				+    "            }\n",
			
 
				+    "        #Add model parameters as part of the dict\n",
			
 
				+    "        request_payload_dict.update(model_kwargs)\n",
			
 
				+    "        request_payload = json.dumps(request_payload_dict)\n",
			
 
				+    "        return str.encode(request_payload)\n",
			
 
				+    "\n",
			
 
				+    "    def format_response_payload(self, output: bytes) -> str:\n",
			
 
				+    "        #Formats response\n",
			
 
				+    "        return json.loads(output)[\"choices\"][0][\"message\"][\"content\"]\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "content_formatter = AzureLlamaAPIContentFormatter()\n",
			
 
				+    "\n",
			
 
				+    "llm = AzureMLOnlineEndpoint(\n",
			
 
				+    "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				+    "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				+    "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 512, \"top_p\": 0.9},\n",
			
 
				+    "    content_formatter=content_formatter,\n",
			
 
				+    ")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "However, you might wonder what is the `content_formatter` in the context when creating the `llm` instance?   \n",
			
 
				+    "The `content_formatter` parameter is a [handler class](https://python.langchain.com/docs/integrations/llms/azure_ml#content-formatter) for transforming the request and response of an AzureML endpoint to match with required schema. Since there are various models in the Azure model catalog, each of which needs to handle the data accordingly.  \n",
			
 
				+    "In our case, all current formatters provided by Langchain including `LLamaContentFormatter` don't follow the schema. So we created our own customized formatter called `AzureLlamaAPIContentFormatter` to handle the input and output data.  \n",
			
 
				+    "\n",
			
 
				+    "Once you have the `llm` ready, you can simple inference it by:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(llm(\"What is good about Wuhan?\"))"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Here is an example that you can create a translator chain with the `llm` instance and translate English to French:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from langchain.chains import LLMChain\n",
			
 
				+    "from langchain.prompts import PromptTemplate\n",
			
 
				+    "\n",
			
 
				+    "template = \"\"\"\n",
			
 
				+    "You are a Translator. Translate the following content from {input_language} to {output_language} and reply with only the translated result.\n",
			
 
				+    "{input_content}\n",
			
 
				+    "\"\"\"\n",
			
 
				+    "\n",
			
 
				+    "translator_chain = LLMChain(\n",
			
 
				+    "    llm = llm,\n",
			
 
				+    "    prompt = PromptTemplate(\n",
			
 
				+    "            template=template,\n",
			
 
				+    "            input_variables=[\"input_language\", \"output_language\", \"input_content\"],\n",
			
 
				+    "        ),\n",
			
 
				+    ")\n",
			
 
				+    "\n",
			
 
				+    "print(translator_chain.run(input_language=\"English\", output_language=\"French\", input_content=\"What is good about Wuhan?\"))\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "At the time of writing this sample notebook, LangChain doesn't support streaming with `AzureMLOnlineEndpoint` for Llama 2. We are working with LangChain and Azure team to implement that."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Build a chatbot with Llama 2 API\n",
			
 
				+    "\n",
			
 
				+    "In this section, we will build a simple chatbot using Azure Llama 2 API, LangChain and [Gradio](https://www.gradio.app/)'s `ChatInterface` with memory capability.\n",
			
 
				+    "\n",
			
 
				+    "Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot [example](https://github.com/facebookresearch/llama-recipes/tree/main/demo_apps/RAG_Chatbot_example) built with Llama 2 on-premises with RAG.   \n",
			
 
				+    "\n",
			
 
				+    "First, let's install Gradio dependencies.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "\n",
			
 
				+    "pip install gradio"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Let's use `AzureMLOnlineEndpoint` class from the previous example.  \n",
			
 
				+    "In this example, we have three major components:  \n",
			
 
				+    "1. Chatbot UI hosted as web interface by Gradio. These are the UI logics render our model predictions.\n",
			
 
				+    "2. Model itself, which is the core component that ingest prompts and return an answer back.\n",
			
 
				+    "3. Memory component, which stores previous conversation context. In this example, we will use [conversation window buffer](https://python.langchain.com/docs/modules/memory/types/buffer_window) which only logs context in certain time window in the past. \n",
			
 
				+    "\n",
			
 
				+    "All of them are chained together using LangChain."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import gradio as gr\n",
			
 
				+    "from langchain.chains import ConversationChain\n",
			
 
				+    "from langchain.prompts import PromptTemplate\n",
			
 
				+    "from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase\n",
			
 
				+    "from langchain.memory import ConversationBufferWindowMemory\n",
			
 
				+    "\n",
			
 
				+    "import langchain\n",
			
 
				+    "from typing import Dict\n",
			
 
				+    "import json\n",
			
 
				+    "\n",
			
 
				+    "langchain.debug=True\n",
			
 
				+    "\n",
			
 
				+    "class AzureLlamaAPIContentFormatter(ContentFormatterBase):\n",
			
 
				+    "#Content formatter for Llama 2 API for Azure MaaS\n",
			
 
				+    "\n",
			
 
				+    "    def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:\n",
			
 
				+    "        #Formats the request according to the chosen api\n",
			
 
				+    "        prompt = ContentFormatterBase.escape_special_characters(prompt)\n",
			
 
				+    "\n",
			
 
				+    "        #Note how we instructed the model with system prompts. Past conversation can be past as in system prompt as well\n",
			
 
				+    "        request_payload_dict = {\n",
			
 
				+    "                \"messages\": [\n",
			
 
				+    "                    {\"role\":\"system\", \"content\":\"The following is a conversation between a user and you. Answer the user question based on the conversation. Provide your answer only\"},\n",
			
 
				+    "                    {\"role\":\"user\", \"content\":f\"{prompt}\"}\n",
			
 
				+    "                    ]               \n",
			
 
				+    "            }\n",
			
 
				+    "        request_payload_dict.update(model_kwargs)\n",
			
 
				+    "        request_payload = json.dumps(request_payload_dict)\n",
			
 
				+    "        return str.encode(request_payload)\n",
			
 
				+    "\n",
			
 
				+    "    def format_response_payload(self, output: bytes) -> str:\n",
			
 
				+    "        #Formats response\n",
			
 
				+    "        return json.loads(output)[\"choices\"][0][\"message\"][\"content\"]\n",
			
 
				+    "\n",
			
 
				+    "#Create content fomartter\n",
			
 
				+    "content_formatter = AzureLlamaAPIContentFormatter()\n",
			
 
				+    "\n",
			
 
				+    "#Create llm instance\n",
			
 
				+    "llm = AzureMLOnlineEndpoint(\n",
			
 
				+    "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				+    "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				+    "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 128, \"top_p\": 0.9},\n",
			
 
				+    "    content_formatter=content_formatter,\n",
			
 
				+    ")\n",
			
 
				+    "\n",
			
 
				+    "#Create memory\n",
			
 
				+    "memory = ConversationBufferWindowMemory(llm=llm, k=5, memory_key=\"chat_history\", ai_prefix=\"Assistant\", human_prefix=\"User\")\n",
			
 
				+    "\n",
			
 
				+    "#Create input prompt template with chat history for chaining\n",
			
 
				+    "INPUT_TEMPLATE = \"\"\"Current conversation:\n",
			
 
				+    "{chat_history}\n",
			
 
				+    "\n",
			
 
				+    "User question:{input}\"\"\"\n",
			
 
				+    "\n",
			
 
				+    "conversation_prompt_template = PromptTemplate(\n",
			
 
				+    "    input_variables=[\"chat_history\", \"input\"], template=INPUT_TEMPLATE\n",
			
 
				+    ")\n",
			
 
				+    "\n",
			
 
				+    "conversation_chain_with_memory = ConversationChain(\n",
			
 
				+    "    llm = llm,\n",
			
 
				+    "    prompt = conversation_prompt_template,\n",
			
 
				+    "    verbose = True,\n",
			
 
				+    "    memory = memory,\n",
			
 
				+    ")\n",
			
 
				+    "\n",
			
 
				+    "#Prediction\n",
			
 
				+    "def predict(message, history):\n",
			
 
				+    "    history_format = []\n",
			
 
				+    "    for user, assistant in history:\n",
			
 
				+    "        history_format.append({\"role\": \"user\", \"content\": user })\n",
			
 
				+    "        history_format.append({\"role\": \"assistant\", \"content\":assistant})\n",
			
 
				+    "    history_format.append({\"role\": \"user\", \"content\": message})\n",
			
 
				+    "    response = conversation_chain_with_memory.run(input=message)\n",
			
 
				+    "    return response\n",
			
 
				+    "\n",
			
 
				+    "#Launch Gradio chatbot interface\n",
			
 
				+    "gr.ChatInterface(predict).launch()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "After successfully execute the code above, a chat interface should appear as the interactive output or you can open the localhost url in your selected browser window.  \n",
			
 
				+    "\n",
			
 
				+    "This concludes our tutorial and examples. Here are some additional reference:  \n",
			
 
				+    "* [Fine-tune Llama](https://learn.microsoft.com/azure/ai-studio/how-to/fine-tune-model-llama)\n",
			
 
				+    "* [Plan and manage costs (marketplace)](https://learn.microsoft.com/azure/ai-studio/how-to/costs-plan-manage#monitor-costs-for-models-offered-through-the-azure-marketplace)\n"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.10.10"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 2
			
 
				+}
			
--- a/demo_apps/README.md
+++ b/demo_apps/README.md
@@ -7,6 +7,7 @@ This folder contains a series of Llama 2-powered apps:
 
				 3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
			
 
				 4. Llama on-prem with vLLM and TGI
			
 
				 5. Llama chatbot with RAG (Retrieval Augmented Generation)
			
 
				+6. Azure Llama 2 API (Model-as-a-Service)
			
 
				 
			
 
				 * Specialized Llama use cases:
			
 
				 1. Ask Llama to summarize a video content
			
@@ -107,3 +108,6 @@ Then enter your question, click Submit. You'll see in the notebook or a browser
 
				 
			
 
				 ### [RAG Chatbot Example](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)
			
 
				 A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data.
			
 
				+
			
 
				+### [Azure API Llama 2 Example](Azure_API_example/azure_api_example.ipynb)
			
 
				+A notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure Model-as-a-Service in CLI, Python, LangChain and a Gradio chatbot example with memory.