vor 1 Jahr · fe40d7f60e
--- a/docs/inference.md
+++ b/docs/inference.md
@@ -41,14 +41,14 @@ model.resize_token_embeddings(model.config.vocab_size + 1)
 
				 ```
			
 
				 Padding would be required for batch inference. In this this [example](../examples/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
			
 
				 
			
 
				-**Chat completion**
			
 
				+### Chat completion
			
 
				 The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
			
 
				 
			
 
				 ```bash
			
 
				 python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg
			
 
				 
			
 
				 ```
			
 
				-**Code Llama**
			
 
				+### Code Llama
			
 
				 
			
 
				 Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
			
 
				 
			
@@ -80,6 +80,18 @@ python examples/code_llama/code_infilling_example.py --model_name MODEL_NAME --p
 
				 
			
 
				 ```
			
 
				 
			
 
				+### Llama Guard
			
 
				+
			
 
				+Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
			
 
				+
			
 
				+Find the inference script for Llama Guard [here](../examples/llama_guard/).
			
 
				+
			
 
				+**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
			
 
				+
			
 
				+Edit [inference.py](../examples/llama_guard/inference.py) to add test prompts for Llama Guard and execute it with this command:
			
 
				+
			
 
				+`python examples/llama_guard/inference.py`
			
 
				+
			
 
				 ## Flash Attention and Xformer Memory Efficient Kernels
			
 
				 
			
 
				 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
			
--- a/examples/README.md
+++ b/examples/README.md
@@ -28,6 +28,8 @@ So far, we have provide the following inference examples:
 
				 
			
 
				 6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) is a notebook that shows how to use Anyscale hosted Llama Guard model to classify user inputs as safe or unsafe.
			
 
				 
			
 
				+7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference excecutions, both for the user input and model output.
			
 
				+
			
 
				 For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
			
 
				 
			
 
				 **Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
			
--- a/examples/llama_guard/README.md
+++ b/examples/llama_guard/README.md
@@ -27,7 +27,7 @@ For testing, you can add User or User/Agent interactions into the prompts list a
 
				 
			
 
				     ]
			
 
				 ```
			
 
				-The complete prompt is built with the `build_prompt` function, defined in [prompt_format.py](../../src/llama_recipes/inference/prompt_format.py#L110). The file contains the default Llama Guard  categories. These categories can adjusted and new ones can be added, as described in the [research paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), on section 4.5 Studying the adaptability of the model.
			
 
				+The complete prompt is built with the `build_prompt` function, defined in [prompt_format.py](../../src/llama_recipes/inference/prompt_format.py). The file contains the default Llama Guard  categories. These categories can adjusted and new ones can be added, as described in the [research paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), on section 4.5 Studying the adaptability of the model.
			
 
				 <!-- markdown-link-check-enable -->
			
 
				 
			
 
				 To run the samples, with all the dependencies installed, execute this command:
			
@@ -37,6 +37,8 @@ To run the samples, with all the dependencies installed, execute this command:
 
				 ## Inference Safety Checker
			
 
				 When running the regular inference script with prompts, Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be show, else a message with the error will be show, with the word unsafe and a comma separated list of categories infringed. Llama Guard is always loaded quantized using Hugging Face Transformers library.
			
 
				 
			
 
				+In this case, the default categories are applied by the tokenizer, using the `apply_chat_template` method.
			
 
				+
			
 
				 Use this command for testing with a quantized Llama model, modifying the values accordingly:
			
 
				 
			
 
				 `python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization --enable_llamaguard_content_safety`
			
--- a/examples/llama_guard/inference.py
+++ b/examples/llama_guard/inference.py
@@ -2,7 +2,7 @@ import fire
 
				 from transformers import AutoTokenizer, AutoModelForCausalLM
			
 
				 
			
 
				 
			
 
				-from llama_recipes.inference.prompt_format import build_prompt, create_conversation, LLAMA_GUARD_CATEGORY
			
 
				+from llama_recipes.inference.prompt_format_utils import build_prompt, create_conversation, LLAMA_GUARD_CATEGORY
			
 
				 from typing import List, Tuple
			
 
				 from enum import Enum