1 year ago · c7d410725b
--- a/README.md
+++ b/README.md
@@ -1,12 +1,10 @@
 
				 # Llama 2 Fine-tuning / Inference Recipes, Examples and Demo Apps
			
 
				 
			
 
				-**[Update Oct. 20, 2023] We have just released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama 2 locally and in the cloud to chat about data (PDF, DB, or live) and generate video summary.**
			
 
				-
			
 
				+**[Update Nov. 14, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama 2 locally, in the cloud, on-prem or with WhatsApp, and how to ask Llama 2 questions in general and about custom data (PDF, DB, or live).**
			
 
				 
			
 
				 The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).
			
 
				 
			
 
				-In addition, we also provide a number of demo apps, to showcase the Llama2 usage along with other ecosystem solutions to run Llama2 locally on your mac and on cloud.
			
 
				-
			
 
				+In addition, we also provide a number of demo apps, to showcase the Llama 2 usage along with other ecosystem solutions to run Llama 2 locally, in the cloud, and on-prem.
			
 
				 
			
 
				 Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the [Responsible Use Guide](https://github.com/facebookresearch/llama/blob/main/Responsible-Use-Guide.pdf). More details can be found in our research paper as well. For downloading the models, follow the instructions on [Llama 2 repo](https://github.com/facebookresearch/llama).
			
 
				 
			
@@ -23,8 +21,6 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				 6. [Repository Organization](#repository-organization)
			
 
				 7. [License and Acceptable Use Policy](#license)
			
 
				 
			
 
				-
			
 
				-
			
 
				 # Quick Start
			
 
				 
			
 
				 [Llama 2 Jupyter Notebook](./examples/quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
			
@@ -184,14 +180,16 @@ You can read more about our fine-tuning strategies [here](./docs/LLM_finetuning.
 
				 # Demo Apps
			
 
				 This folder contains a series of Llama2-powered apps:
			
 
				 * Quickstart Llama deployments and basic interactions with Llama
			
 
				-  1. Llama on your Mac and ask Llama general questions
			
 
				-  2. Llama on Google Colab
			
 
				-  3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
			
 
				+1. Llama on your Mac and ask Llama general questions
			
 
				+2. Llama on Google Colab
			
 
				+3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
			
 
				+4. Llama on-prem with vLLM and TGI
			
 
				 
			
 
				 * Specialized Llama use cases:
			
 
				-  1. Ask Llama to summarize a video content
			
 
				-  2. Ask Llama questions about structured data in a DB
			
 
				-  3. Ask Llama questions about live data on the web
			
 
				+1. Ask Llama to summarize a video content
			
 
				+2. Ask Llama questions about structured data in a DB
			
 
				+3. Ask Llama questions about live data on the web
			
 
				+4. Build a Llama-enabled WhatsApp chatbot
			
 
				 
			
 
				 # Repository Organization
			
 
				 This repository is organized in the following way:
			
--- a/demo_apps/README.md
+++ b/demo_apps/README.md
--- a/demo_apps/llama-on-prem.md
+++ b/demo_apps/llama-on-prem.md
--- a/demo_apps/llama_chatbot.py
+++ b/demo_apps/llama_chatbot.py
@@ -0,0 +1,61 @@
 
				+import langchain
			
 
				+from langchain.llms import Replicate
			
 
				+
			
 
				+from flask import Flask
			
 
				+from flask import request
			
 
				+import os
			
 
				+import requests
			
 
				+import json
			
 
				+
			
 
				+class WhatsAppClient:
			
 
				+
			
 
				+    API_URL = "https://graph.facebook.com/v17.0/"
			
 
				+    WHATSAPP_API_TOKEN = "<Temporary access token from your WhatsApp API Setup>"
			
 
				+    WHATSAPP_CLOUD_NUMBER_ID = "<Phone number ID from your WhatsApp API Setup>"
			
 
				+
			
 
				+    def __init__(self):
			
 
				+        self.headers = {
			
 
				+            "Authorization": f"Bearer {self.WHATSAPP_API_TOKEN}",
			
 
				+            "Content-Type": "application/json",
			
 
				+        }
			
 
				+        self.API_URL = self.API_URL + self.WHATSAPP_CLOUD_NUMBER_ID
			
 
				+
			
 
				+    def send_text_message(self,message, phone_number):
			
 
				+        payload = {
			
 
				+            "messaging_product": 'whatsapp',
			
 
				+            "to": phone_number,
			
 
				+            "type": "text",
			
 
				+            "text": {
			
 
				+                "preview_url": False,
			
 
				+                "body": message
			
 
				+            }
			
 
				+        }
			
 
				+        response = requests.post(f"{self.API_URL}/messages", json=payload,headers=self.headers)
			
 
				+        print(response.status_code)
			
 
				+        assert response.status_code == 200, "Error sending message"
			
 
				+        return response.status_code
			
 
				+
			
 
				+os.environ["REPLICATE_API_TOKEN"] = "<your replicate api token>"    
			
 
				+llama2_13b_chat = "meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"
			
 
				+
			
 
				+llm = Replicate(
			
 
				+    model=llama2_13b_chat,
			
 
				+    model_kwargs={"temperature": 0.01, "top_p": 1, "max_new_tokens":500}
			
 
				+)
			
 
				+client = WhatsAppClient()
			
 
				+app = Flask(__name__)
			
 
				+
			
 
				+@app.route("/")
			
 
				+def hello_llama():
			
 
				+    return "<p>Hello Llama 2</p>"
			
 
				+
			
 
				+@app.route('/msgrcvd', methods=['POST', 'GET'])
			
 
				+def msgrcvd():    
			
 
				+    message = request.args.get('message')
			
 
				+    #client.send_template_message("hello_world", "en_US", "14086745477")
			
 
				+    answer = llm(message)
			
 
				+    print(message)
			
 
				+    print(answer)
			
 
				+    client.send_text_message(llm(message), "14086745477")
			
 
				+    return message + "<p/>" + answer
			
 
				+
			
--- a/demo_apps/whatsapp_dashboard.jpg
+++ b/demo_apps/whatsapp_dashboard.jpg
--- a/demo_apps/whatsapp_llama2.md
+++ b/demo_apps/whatsapp_llama2.md
--- a/demo_apps/whatsapp_llama_arch.jpg
+++ b/demo_apps/whatsapp_llama_arch.jpg
--- a/docs/inference.md
+++ b/docs/inference.md
@@ -144,3 +144,5 @@ python examples/vllm/inference.py --model_name <PATH/TO/MODEL/7B>
 
				 ```
			
 
				 
			
 
				 [**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../examples/hf_text_generation_inference/README.md).
			
 
				+
			
 
				+[Here](../demo_apps/llama-on-prem.md) is a complete tutorial on how to use vLLM and TGI to deploy Llama 2 on-prem and interact with the Llama API services.
			
--- a/scripts/spellcheck_conf/wordlist.txt
+++ b/scripts/spellcheck_conf/wordlist.txt
@@ -1184,4 +1184,34 @@ pdf
 
				 quantized
			
 
				 serarch
			
 
				 streamlit
			
 
				-
			
 
				+prem
			
 
				+Prem
			
 
				+OpenAI
			
 
				+Prem
			
 
				+TCP
			
 
				+ba
			
 
				+llm
			
 
				+logprobs
			
 
				+openai
			
 
				+rohit
			
 
				+tgi
			
 
				+Axios
			
 
				+Chatbot
			
 
				+WHATSAPP
			
 
				+Webhooks
			
 
				+WhatsApp
			
 
				+WhatsAppClient
			
 
				+adffb
			
 
				+axios
			
 
				+baba
			
 
				+chatbot
			
 
				+chatbots
			
 
				+de
			
 
				+eeeb
			
 
				+gunicorn
			
 
				+knowledgable
			
 
				+msgrcvd
			
 
				+venv
			
 
				+webhook
			
 
				+webhook's
			
 
				+whatsapp
			
--- a/src/llama_recipes/utils/train_utils.py
+++ b/src/llama_recipes/utils/train_utils.py
@@ -19,7 +19,7 @@ from transformers import LlamaTokenizer
 
				 
			
 
				 
			
 
				 from llama_recipes.model_checkpointing import save_model_checkpoint, save_model_and_optimizer_sharded, save_optimizer_checkpoint
			
 
				-from llama_recipes.policies import fpSixteen,bfSixteen_mixed, get_llama_wrapper
			
 
				+from llama_recipes.policies import fpSixteen,bfSixteen, get_llama_wrapper
			
 
				 from llama_recipes.utils.memory_utils import MemoryTrace
			
 
				 
			
 
				 
			
@@ -367,7 +367,7 @@ def get_policies(cfg, rank):
 
				         bf16_ready = verify_bfloat_support
			
 
				 
			
 
				         if bf16_ready and not cfg.use_fp16:
			
 
				-            mixed_precision_policy = bfSixteen_mixed
			
 
				+            mixed_precision_policy = bfSixteen
			
 
				             if rank == 0:
			
 
				                 print(f"bFloat16 enabled for mixed precision - using bfSixteen policy")
			
 
				         elif cfg.use_fp16: