|
@@ -4,12 +4,12 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "# Building a Llama 2 chatbot with RAG\n",
|
|
|
+ "# Building a Llama 2 chatbot with Retrieval Augmented Generation (RAG)\n",
|
|
|
"\n",
|
|
|
- "This notebook shows a complete example of how to build a Llama 2 chatbot hosted on your Browser that can answer questions based on your own data. We'll cover:\n",
|
|
|
- "* The deployment process of Llama 2 7B with Text-generation-inference framework as a API server\n",
|
|
|
- "* A chatbot example built with Gradio and wired to the server\n",
|
|
|
- "* Add RAG capability with Llama 2 specific knowledge based on our Getting Started [guide](https://ai.meta.com/llama/get-started/)"
|
|
|
+ "This notebook shows a complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data. We'll cover:\n",
|
|
|
+ "* The deployment process of Llama 2 7B with the [Text-generation-inference](https://github.com/huggingface/text-generation-inference) framework as an API server\n",
|
|
|
+ "* A chatbot example built with [Gradio](https://github.com/gradio-app/gradio) and wired to the server\n",
|
|
|
+ "* Adding RAG capability with Llama 2 specific knowledge based on our Getting Started [guide](https://ai.meta.com/llama/get-started/)"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -18,11 +18,17 @@
|
|
|
"source": [
|
|
|
"## RAG Architecture\n",
|
|
|
"\n",
|
|
|
- "RAG (Retrieval Augmented Generation) is a method that:\n",
|
|
|
+ "LLMs have unprecedented capabilities in NLU (Natural Language Understanding) & NLG (Natural Language Generation), but they have a knowledge cutoff date, and are only trained on publicly available data before that date.\n",
|
|
|
+ "\n",
|
|
|
+ "RAG, invented by [Meta](https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) in 2020, is one of the most popular methods to augment LLMs. RAG allows enterprises to keep sensitive data on-prem and get more relevant answers from generic models without fine-tuning models for specific roles.\n",
|
|
|
+ "\n",
|
|
|
+ "RAG is a method that:\n",
|
|
|
"* Retrieves data from outside a foundation model\n",
|
|
|
"* Augments your questions or prompts to LLMs by adding the retrieved relevant data as context\n",
|
|
|
"* Allows LLMs to answer questions about your own data, or data not publicly available when LLMs were trained\n",
|
|
|
- "* Greatly reduce the hallucination in model generation"
|
|
|
+ "* Greatly reduces the hallucination in model's response generation\n",
|
|
|
+ "\n",
|
|
|
+ "The following diagram shows the general RAG components and process:"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -41,27 +47,16 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "LLMs have unprecedented capabilities in NLU (Natural Language Understanding) & NLG (Natural Language Generation), but they have a knowledge cutoff date, and are only trained on publicly available data before that date.\n",
|
|
|
- "\n",
|
|
|
- "RAG, invented by [Meta](https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) in 2020, is one of the most popular methods to augment LLMs. RAG allows enterprises to keep sensitive data on-prem and get more relevant answers from generic models without fine-tuning models for specific roles."
|
|
|
- ]
|
|
|
- },
|
|
|
- {
|
|
|
- "cell_type": "markdown",
|
|
|
- "metadata": {},
|
|
|
- "source": [
|
|
|
- "## How to Develop A RAG Powered Llama 2 Chatbot\n",
|
|
|
+ "## How to Develop a RAG Powered Llama 2 Chatbot\n",
|
|
|
"\n",
|
|
|
- "The easiest way to develop RAG-powered Llama 2 chatbots is to use frameworks such as **LangChain** and **LlamaIndex**, two leading open-source frameworks for building LLM apps. \n",
|
|
|
- "Both offer convenient APIs for implementing RAG with Llama 2 and here are some steps we will take:\n",
|
|
|
+ "The easiest way to develop RAG-powered Llama 2 chatbots is to use frameworks such as [**LangChain**](https://www.langchain.com/) and [**LlamaIndex**](https://www.llamaindex.ai/), two leading open-source frameworks for building LLM apps. Both offer convenient APIs for implementing RAG with Llama 2 including:\n",
|
|
|
"\n",
|
|
|
"* Load and split documents\n",
|
|
|
"* Embed and store document splits\n",
|
|
|
"* Retrieve the relevant context based on the user query\n",
|
|
|
"* Call Llama 2 with query and context to generate the answer\n",
|
|
|
"\n",
|
|
|
- "LangChain is a more general purpose and flexible framework for developing LLM apps with RAG capabilities, while LlamaIndex as data framework focus on connecting custom data sources to LLMs. \n",
|
|
|
- "The integration of the two may provide the best performant and effective solution to building real world RAG apps. \n",
|
|
|
+ "LangChain is a more general purpose and flexible framework for developing LLM apps with RAG capabilities, while LlamaIndex as a data framework focuses on connecting custom data sources to LLMs. The integration of the two may provide the best performant and effective solution to building real world RAG apps. \n",
|
|
|
"In our example, for simplicifty, we will use LangChain alone with locally stored PDF data."
|
|
|
]
|
|
|
},
|
|
@@ -71,9 +66,10 @@
|
|
|
"source": [
|
|
|
"### Install Dependencies\n",
|
|
|
"\n",
|
|
|
- "For this demo, we will be using the [Gradio](https://www.gradio.app/) for chatbot UI, [Text-generation-inference](https://github.com/huggingface/text-generation-inference) framework for model serving. \n",
|
|
|
+ "For this demo, we will be using the Gradio for chatbot UI, Text-generation-inference framework for model serving. \n",
|
|
|
"For vector storage and similarity search, we will be using [FAISS](https://github.com/facebookresearch/faiss). \n",
|
|
|
- "In this example, we will be running everything in a AWS EC2 instance (i.e. g5.2xlarge).\n",
|
|
|
+ "In this example, we will be running everything in a AWS EC2 instance (i.e. [g5.2xlarge]( https://aws.amazon.com/ec2/instance-types/g5/)). g5.2xlarge features one A10G GPU. We recommend running this notebook with at least one GPU equivalent to A10G with at least 16GB video memory. \n",
|
|
|
+ "There are certain techniques to downsize the Llama 2 7B model, so it can fit into smaller GPUs. But it is out of scope here.\n",
|
|
|
"\n",
|
|
|
"First, let's install all dependencies with PIP. We also recommend you start a dedicated Conda environment for better package management"
|
|
|
]
|
|
@@ -94,12 +90,12 @@
|
|
|
"### Data Processing\n",
|
|
|
"\n",
|
|
|
"First run all the imports and define the path of the data and vector storage after processing. \n",
|
|
|
- "For the data, we will be using a raw pdf crawled from Llama 2 Getting Started guide on Meta AI website."
|
|
|
+ "For the data, we will be using a raw pdf crawled from Llama 2 Getting Started guide on [Meta AI website](https://ai.meta.com/llama/)."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 9,
|
|
|
+ "execution_count": 5,
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -108,7 +104,7 @@
|
|
|
"from langchain.document_loaders import PyPDFDirectoryLoader\n",
|
|
|
"from langchain.text_splitter import RecursiveCharacterTextSplitter \n",
|
|
|
"\n",
|
|
|
- "DATA_PATH = '/data' #Your root data folder path\n",
|
|
|
+ "DATA_PATH = 'data' #Your root data folder path\n",
|
|
|
"DB_FAISS_PATH = 'vectorstore/db_faiss'"
|
|
|
]
|
|
|
},
|
|
@@ -121,7 +117,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 45,
|
|
|
+ "execution_count": 6,
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -133,7 +129,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Check the length and content of the doc to ensure we have loaded the right document with number of pages."
|
|
|
+ "Check the length and content of the doc to ensure we have loaded the right document with number of pages as 37."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -150,7 +146,7 @@
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"Split the loaded documents into smaller chunks. \n",
|
|
|
- "`RecursiveCharacterTextSplitter` is one common splitter that splits long pieces of text into smaller, semantically meaningful chunk. \n",
|
|
|
+ "[`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) is one common splitter that splits long pieces of text into smaller, semantically meaningful chunks. \n",
|
|
|
"Other splitters include:\n",
|
|
|
"* SpacyTextSplitter\n",
|
|
|
"* NLTKTextSplitter\n",
|
|
@@ -199,13 +195,13 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Lasty, with splits and choice of embedding ready. We want to index them and store all the split chunks as embeddings into the vector storage. \n",
|
|
|
+ "Lastly, with splits and choice of the embedding model ready, we want to index them and store all the split chunks as embeddings into the vector storage. \n",
|
|
|
"\n",
|
|
|
"Vector stores are databases storing embeddings. There're at least 60 [vector stores](https://python.langchain.com/docs/integrations/vectorstores) supported by LangChain, and two of the most popular open source ones are:\n",
|
|
|
"* [Chroma](https://www.trychroma.com/): a light-weight and in memory so it's easy to get started with and use for **local development**.\n",
|
|
|
"* [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss) (Facebook AI Similarity Search): a vector store that supports search in vectors that may not fit in RAM and is appropriate for **production use**. \n",
|
|
|
"\n",
|
|
|
- "Since we are running on a EC2 instance with abundant CPU resources and RAM, we will use FAISS in this example."
|
|
|
+ "Since we are running on a EC2 instance with abundant CPU resources and RAM, we will use FAISS in this example. Note that FAISS can also run on GPUs, where some of the most useful algorithms are implemented there. In that case, install `faiss-gpu` package with PIP instead."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -222,7 +218,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Once you saved database into local path. You can found them as `index.faiss` and `index.pkl`. In the chatbot example, you can then load this database from local and plug it into our retrival process."
|
|
|
+ "Once you saved database into local path. You can find them as `index.faiss` and `index.pkl`. In the chatbot example, you can then load this database from local and plug it into our retrival process."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -231,7 +227,7 @@
|
|
|
"source": [
|
|
|
"### Model Serving\n",
|
|
|
"\n",
|
|
|
- "In this example, we will be deploying a Llama 2 7B chat HuggingFace model with Text-generation-inference framework on-permises. \n",
|
|
|
+ "In this example, we will be deploying a Llama 2 7B chat HuggingFace model with the Text-generation-inference framework on-permises. \n",
|
|
|
"This would allow us to directly wire the API server with our chatbot. \n",
|
|
|
"There are alternative solutions to deploy Llama 2 models on-permises as your local API server. \n",
|
|
|
"You can find our complete guide [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md)."
|
|
@@ -241,7 +237,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "In a **separate terminal**, run commands below to launch a API server with TGI. This will download model artifacts and store them locally, while launching at the desire port on your localhost. In our case, this is port 8080"
|
|
|
+ "In a **separate terminal**, run commands below to launch an API server with TGI. This will download model artifacts and store them locally, while launching at the desire port on your localhost. In our case, this is port 8080"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -252,7 +248,7 @@
|
|
|
"source": [
|
|
|
"model = meta-llama/Llama-2-7b-chat-hf \n",
|
|
|
"volume = $PWD/data \n",
|
|
|
- "token = Your own HF tokens \n",
|
|
|
+ "token = #Your own HF tokens \n",
|
|
|
"docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model"
|
|
|
]
|
|
|
},
|
|
@@ -260,7 +256,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Once we have the API server up and running. We can run simple `curl` command to validate our model is working as expected."
|
|
|
+ "Once we have the API server up and running, we can run a simple `curl` command to validate our model is working as expected."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -278,21 +274,18 @@
|
|
|
"source": [
|
|
|
"### Building the Chatbot UI\n",
|
|
|
"\n",
|
|
|
- "Now we are ready to build the chatbot UI to wire up RAG data and API server. \n",
|
|
|
- "In our example we will be using Gradio to build the Chatbot UI. \n",
|
|
|
- "Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications. \n",
|
|
|
- "It had been widely used by the community and HuggingFace also used Gradio to build their Chatbots. \n",
|
|
|
- "Other alternatives are: \n",
|
|
|
- "* Streamlit\n",
|
|
|
- "* Dash\n",
|
|
|
- "* Flask"
|
|
|
+ "Now we are ready to build the chatbot UI to wire up RAG data and API server. In our example we will be using Gradio to build the Chatbot UI. \n",
|
|
|
+ "Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications. It had been widely used by the community and HuggingFace also used Gradio to build their Chatbots. Other alternatives are: \n",
|
|
|
+ "* [Streamlit](https://streamlit.io/)\n",
|
|
|
+ "* [Dash](https://plotly.com/dash/)\n",
|
|
|
+ "* [Flask](https://flask.palletsprojects.com/en/3.0.x/)"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Again, we start by adding all the imports, paths, constants and set langchain in debug mode, so it shows clear actions within the chain process."
|
|
|
+ "Again, we start by adding all the imports, paths, constants and set LangChain in debug mode, so it shows clear actions within the chain process."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -353,7 +346,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Now we create a TGI llm instance with default hyperparameters and wire to the API serving port on localhost"
|
|
|
+ "Now we create a TGI llm instance and wire to the API serving port on localhost"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -380,7 +373,7 @@
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"Next, we define the retriever and template for our RetrivalQA chain. For each call of the RetrievalQA, LangChain performs a semantic similarity search of the query in the vector database, then passes the search results as the context to Llama to answer the query about the data stored in the verctor database. \n",
|
|
|
- "Whereas for the template, this defines the format of the question along with context that we will be sent into modes for generation. In general, Llama 2 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that.\n"
|
|
|
+ "Whereas for the template, this defines the format of the question along with context that we will be sent into Llama for generation. In general, Llama 2 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that.\n"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -446,8 +439,8 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "After confirming the validity, we can start building the UI. Before we define the gradio blocks, let's first define the callback streams that we will use later for streaming feature. \n",
|
|
|
- "This callback handler will put streaming LLM responses to a queue for gradio UI to render on the flight. "
|
|
|
+ "After confirming the validity, we can start building the UI. Before we define the gradio [blocks](https://www.gradio.app/docs/blocks), let's first define the callback streams that we will use later for the streaming feature. \n",
|
|
|
+ "This callback handler will put streaming LLM responses to a queue for gradio UI to render on the fly. "
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -492,12 +485,46 @@
|
|
|
" with gr.Column(scale=1):\n",
|
|
|
" with gr.Row():\n",
|
|
|
" #model selection\n",
|
|
|
- " model_selector = gr.Dropdown(list(model_dict.keys()), value=\"7b-chat\", label=\"Model\", info=\"Select the model\", interactive = True, scale=1)\n",
|
|
|
- " max_new_tokens_selector = gr.Number(value=512, precision=0, label=\"Max new tokens\", info=\"Adjust max_new_tokens\",interactive = True, minimum=1, maximum=1024, scale=1)\n",
|
|
|
+ " model_selector = gr.Dropdown(\n",
|
|
|
+ " list(model_dict.keys()), \n",
|
|
|
+ " value=\"7b-chat\", \n",
|
|
|
+ " label=\"Model\", \n",
|
|
|
+ " info=\"Select the model\", \n",
|
|
|
+ " interactive = True, \n",
|
|
|
+ " scale=1\n",
|
|
|
+ " )\n",
|
|
|
+ " max_new_tokens_selector = gr.Number(\n",
|
|
|
+ " value=512, \n",
|
|
|
+ " precision=0, \n",
|
|
|
+ " label=\"Max new tokens\", \n",
|
|
|
+ " info=\"Adjust max_new_tokens\",\n",
|
|
|
+ " interactive = True, \n",
|
|
|
+ " minimum=1, \n",
|
|
|
+ " maximum=1024, \n",
|
|
|
+ " scale=1\n",
|
|
|
+ " )\n",
|
|
|
" with gr.Row():\n",
|
|
|
" #hyperparameter selection\n",
|
|
|
- " temperature_selector = gr.Slider(value=0.6, label=\"Temperature\", info=\"Range 0-2. Controls the creativity of the generated text.\",interactive = True, minimum=0.01, maximum=2, step=0.01, scale=1)\n",
|
|
|
- " top_p_selector = gr.Slider(value=0.9, label=\"Top_p\", info=\"Range 0-1. Nucleus sampling. \",interactive = True, minimum=0.01, maximum=0.99, step=0.01, scale=1)\n",
|
|
|
+ " temperature_selector = gr.Slider(\n",
|
|
|
+ " value=0.6, \n",
|
|
|
+ " label=\"Temperature\", \n",
|
|
|
+ " info=\"Range 0-2. Controls the creativity of the generated text.\",\n",
|
|
|
+ " interactive = True, \n",
|
|
|
+ " minimum=0.01, \n",
|
|
|
+ " maximum=2, \n",
|
|
|
+ " step=0.01, \n",
|
|
|
+ " scale=1\n",
|
|
|
+ " )\n",
|
|
|
+ " top_p_selector = gr.Slider(\n",
|
|
|
+ " value=0.9, \n",
|
|
|
+ " label=\"Top_p\", \n",
|
|
|
+ " info=\"Range 0-1. Nucleus sampling.\",\n",
|
|
|
+ " interactive = True, \n",
|
|
|
+ " minimum=0.01, \n",
|
|
|
+ " maximum=0.99, \n",
|
|
|
+ " step=0.01, \n",
|
|
|
+ " scale=1\n",
|
|
|
+ " )\n",
|
|
|
" with gr.Column(scale=2):\n",
|
|
|
" #user input prompt text field\n",
|
|
|
" user_prompt_message = gr.Textbox(placeholder=\"Please add user prompt here\", label=\"User prompt\")\n",
|
|
@@ -515,7 +542,7 @@
|
|
|
" else:\n",
|
|
|
" return history + [[\"Invalid prompts - user prompt cannot be empty\", None]]\n",
|
|
|
"\n",
|
|
|
- " #chatbot logics for configuration, sending the prompts, rendering the streamed back genereations etc\n",
|
|
|
+ " #chatbot logic for configuration, sending the prompts, rendering the streamed back genereations etc\n",
|
|
|
" def bot(model_selector, temperature_selector, top_p_selector, max_new_tokens_selector, user_prompt_message, history, messages_history):\n",
|
|
|
" dialog = []\n",
|
|
|
" bot_message = \"\"\n",
|
|
@@ -561,17 +588,40 @@
|
|
|
" def input_cleanup():\n",
|
|
|
" return \"\"\n",
|
|
|
"\n",
|
|
|
- " #when user click Enter and user message are submitted\n",
|
|
|
- " user_prompt_message.submit(user, [user_prompt_message, chatbot], [chatbot], queue=False).then(\n",
|
|
|
- " bot, [model_selector, temperature_selector, top_p_selector, max_new_tokens_selector, user_prompt_message, chatbot, state], [chatbot, state]\n",
|
|
|
- " ).then(input_cleanup, [], [user_prompt_message], queue=False)\n",
|
|
|
- "\n",
|
|
|
- " #when user click the submit button\n",
|
|
|
- " submitBtn.click(user, [user_prompt_message, chatbot], [chatbot], queue=False).then(\n",
|
|
|
- " bot, [model_selector, temperature_selector, top_p_selector, max_new_tokens_selector, user_prompt_message, chatbot, state], [chatbot, state]\n",
|
|
|
- " ).then(input_cleanup, [], [user_prompt_message], queue=False)\n",
|
|
|
+ " #when the user clicks Enter and the user message is submitted\n",
|
|
|
+ " user_prompt_message.submit(\n",
|
|
|
+ " user, \n",
|
|
|
+ " [user_prompt_message, chatbot], \n",
|
|
|
+ " [chatbot], \n",
|
|
|
+ " queue=False\n",
|
|
|
+ " ).then(\n",
|
|
|
+ " bot, \n",
|
|
|
+ " [model_selector, temperature_selector, top_p_selector, max_new_tokens_selector, user_prompt_message, chatbot, state], \n",
|
|
|
+ " [chatbot, state]\n",
|
|
|
+ " ).then(input_cleanup, \n",
|
|
|
+ " [], \n",
|
|
|
+ " [user_prompt_message], \n",
|
|
|
+ " queue=False\n",
|
|
|
+ " )\n",
|
|
|
+ "\n",
|
|
|
+ " #when the user clicks the submit button\n",
|
|
|
+ " submitBtn.click(\n",
|
|
|
+ " user, \n",
|
|
|
+ " [user_prompt_message, chatbot], \n",
|
|
|
+ " [chatbot], \n",
|
|
|
+ " queue=False\n",
|
|
|
+ " ).then(\n",
|
|
|
+ " bot, \n",
|
|
|
+ " [model_selector, temperature_selector, top_p_selector, max_new_tokens_selector, user_prompt_message, chatbot, state], \n",
|
|
|
+ " [chatbot, state]\n",
|
|
|
+ " ).then(\n",
|
|
|
+ " input_cleanup, \n",
|
|
|
+ " [], \n",
|
|
|
+ " [user_prompt_message], \n",
|
|
|
+ " queue=False\n",
|
|
|
+ " )\n",
|
|
|
" \n",
|
|
|
- " #when user click the clear button\n",
|
|
|
+ " #when the user clicks the clear button\n",
|
|
|
" clear.click(lambda: None, None, chatbot, queue=False).success(init_history, [state], [state])"
|
|
|
]
|
|
|
},
|
|
@@ -579,13 +629,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Lastly, we can launch this demo on our localhost. Make sure you have access to the ports. \n",
|
|
|
- "In the notebook or a browser with URL http://0.0.0.0:7860 you should see the UI. \n",
|
|
|
- "Things to try in the chatbot demo: \n",
|
|
|
- "* Specific questions relates Llama 2 get started\n",
|
|
|
- "* Streaming\n",
|
|
|
- "* Adjust hyperparameters such as max new token generated\n",
|
|
|
- "* Switch models with another container launched in a separate terminal"
|
|
|
+ "Lastly, we can launch this demo on our localhost with the command below. "
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -597,6 +641,21 @@
|
|
|
"demo.queue().launch(server_name=\"0.0.0.0\")"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Gradio will default the launch port to 7860. You can select which port it should launch on as needed. \n",
|
|
|
+ "Once launched, in the notebook or a browser with URL http://0.0.0.0:7860, you should see the UI. \n",
|
|
|
+ "Things to try in the chatbot demo: \n",
|
|
|
+ "* Asking specific questions related to the Llama 2 Getting Started Guide\n",
|
|
|
+ "* Streaming\n",
|
|
|
+ "* Adjust parameters such as max new token generated\n",
|
|
|
+ "* Switching to another Llama model with another container launched in a separate terminal\n",
|
|
|
+ "\n",
|
|
|
+ "Once finished testing, make sure you close the demo by running the command below to release the port."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
@@ -623,7 +682,7 @@
|
|
|
"name": "python",
|
|
|
"nbconvert_exporter": "python",
|
|
|
"pygments_lexer": "ipython3",
|
|
|
- "version": "3.10.10"
|
|
|
+ "version": "3.9.6"
|
|
|
}
|
|
|
},
|
|
|
"nbformat": 4,
|