1 anno fa · 43a1e5cdb0
--- a/UPDATES.md
+++ b/UPDATES.md
@@ -16,4 +16,4 @@ As noted in the documentation, these strings are required to use the fine-tuned
 
				 ### Updated approach
			
 
				 We recommend sanitizing [these strings](https://github.com/meta-llama/llama?tab=readme-ov-file#fine-tuned-chat-models) from any user provided prompts. Sanitization of user prompts mitigates malicious or accidental abuse of these strings. The provided scripts have been updated to do this. 
			
 
				 
			
 
				-Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/local_inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository.
			
 
				+Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](./recipes/inference/local_inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository.
			
--- a/recipes/README.md
+++ b/recipes/README.md
--- a/recipes/benchmarks/inference/on-prem/README.md
+++ b/recipes/benchmarks/inference/on-prem/README.md
@@ -6,7 +6,9 @@ We support benchmark on these serving framework:
 
				 
			
 
				 
			
 
				 # vLLM - Getting Started
			
 
				-To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/model_servers/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-prem.
			
 
				+
			
 
				+To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](../../../inference/model_servers/llama-on-prem.md#setting-up-vllm-with-llama-2) to deploy vLLM on-prem.
			
 
				+
			
 
				 Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism. Additionally, as deploying multiple model replicas, there is a need for a higher level wrapper to handle the load balancing which here has been simulated in the benchmark scripts.  
			
 
				 For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Llama 2 70B chat model, which is around 140GB with FP16. So for deployment we can do:
			
 
				 * 1x70B model parallel on 8 GPUs, each GPU RAM takes around 17.5GB for loading model weights.
			
--- a/recipes/inference/model_servers/llama-on-prem.md
+++ b/recipes/inference/model_servers/llama-on-prem.md
--- a/recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
+++ b/recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
@@ -461,7 +461,7 @@
 
				     "\n",
			
 
				     "In this section, we will build a simple chatbot using Azure Llama 2 API, LangChain and [Gradio](https://www.gradio.app/)'s `ChatInterface` with memory capability.\n",
			
 
				     "\n",
			
 
				-    "Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot [example](https://github.com/facebookresearch/llama-recipes/tree/main/demo_apps/RAG_Chatbot_example) built with Llama 2 on-premises with RAG.   \n",
			
 
				+    "Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot [example](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb) built with Llama 2 on-premises with RAG.   \n",
			
 
				     "\n",
			
 
				     "First, let's install Gradio dependencies.\n"
			
 
				    ]
			
--- a/recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
+++ b/recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
@@ -326,7 +326,7 @@
 
				         "- [Llama 2](https://ai.meta.com/llama/)\n",
			
 
				         "- [Getting Started Guide - Llama 2](https://ai.meta.com/llama/get-started/)\n",
			
 
				         "- [GitHub - Llama 2](https://github.com/facebookresearch/llama)\n",
			
 
				-        "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes) and [Llama 2 Demo Apps](https://github.com/meta-llama/llama-recipes/tree/main/recipes)\n",
			
 
				+        "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)\n",
			
 
				         "- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n",
			
 
				         "- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)\n",
			
 
				         "- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n",
			
--- a/recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
+++ b/recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
@@ -237,7 +237,7 @@
 
				     "- [Llama 2](https://ai.meta.com/llama/)\n",
			
 
				     "- [Getting Started Guide - Llama 2](https://ai.meta.com/llama/get-started/)\n",
			
 
				     "- [GitHub - Llama 2](https://github.com/facebookresearch/llama)\n",
			
 
				-    "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes) and [Llama 2 Demo Apps](https://github.com/facebookresearch/llama-recipes/tree/main/demo_apps)\n",
			
 
				+    "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)\n",
			
 
				     "- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n",
			
 
				     "- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)\n",
			
 
				     "- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n",
			
--- a/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
+++ b/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
@@ -230,7 +230,7 @@
 
				     "In this example, we will be deploying a Llama 2 7B chat HuggingFace model with the Text-generation-inference framework on-permises.  \n",
			
 
				     "This would allow us to directly wire the API server with our chatbot.  \n",
			
 
				     "There are alternative solutions to deploy Llama 2 models on-permises as your local API server.  \n",
			
 
				-    "You can find our complete guide [here](https://github.com/facebookresearch/llama-recipes/blob/main/demo_apps/llama-on-prem.md)."
			
 
				+    "You can find our complete guide [here](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/model_servers/llama-on-prem.md)."
			
 
				    ]
			
 
				   },
			
 
				   {
			
--- a/recipes/use_cases/chatbots/messenger_llama/messenger_llama2.md
+++ b/recipes/use_cases/chatbots/messenger_llama/messenger_llama2.md
--- a/recipes/use_cases/chatbots/whatsapp_llama/whatsapp_llama2.md
+++ b/recipes/use_cases/chatbots/whatsapp_llama/whatsapp_llama2.md