1 year ago · fe40d7f60e
--- a/docs/inference.md
+++ b/docs/inference.md
@@ -41,14 +41,14 @@ model.resize_token_embeddings(model.config.vocab_size + 1)
 
																 ```
															
 
																 Padding would be required for batch inference. In this this [example](../examples/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
															
 
																-**Chat completion**
															
 
																+### Chat completion
															
 
																 The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
															
 
																 ```bash
															
 
																 python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg
															
 
																 ```
															
 
																-**Code Llama**
															
 
																+### Code Llama
															
 
																 Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
															
@@ -80,6 +80,18 @@ python examples/code_llama/code_infilling_example.py --model_name MODEL_NAME --p
 
																 ```
															
 
																+### Llama Guard
															
 
																+
															
 
																+Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
															
 
																+
															
 
																+Find the inference script for Llama Guard [here](../examples/llama_guard/).
															
 
																+
															
 
																+**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
															
 
																+
															
 
																+Edit [inference.py](../examples/llama_guard/inference.py) to add test prompts for Llama Guard and execute it with this command:
															
 
																+
															
 
																+`python examples/llama_guard/inference.py`
															
 
																+
															
 
																 ## Flash Attention and Xformer Memory Efficient Kernels
															
 
																 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
															
--- a/examples/README.md
+++ b/examples/README.md
@@ -28,6 +28,8 @@ So far, we have provide the following inference examples:
 
																 6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) is a notebook that shows how to use Anyscale hosted Llama Guard model to classify user inputs as safe or unsafe.
															
 
																+7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference excecutions, both for the user input and model output.
															
 
																+
															
 
																 For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
															
 
																 **Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
															
--- a/examples/llama_guard/README.md
+++ b/examples/llama_guard/README.md
@@ -27,7 +27,7 @@ For testing, you can add User or User/Agent interactions into the prompts list a
 
																     ]
															
 
																 ```
															
 
																-The complete prompt is built with the `build_prompt` function, defined in [prompt_format.py](../../src/llama_recipes/inference/prompt_format.py#L110). The file contains the default Llama Guard  categories. These categories can adjusted and new ones can be added, as described in the [research paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), on section 4.5 Studying the adaptability of the model.
															
 
																+The complete prompt is built with the `build_prompt` function, defined in [prompt_format.py](../../src/llama_recipes/inference/prompt_format.py). The file contains the default Llama Guard  categories. These categories can adjusted and new ones can be added, as described in the [research paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), on section 4.5 Studying the adaptability of the model.
															
 
																 <!-- markdown-link-check-enable -->
															
 
																 To run the samples, with all the dependencies installed, execute this command:
															
@@ -37,6 +37,8 @@ To run the samples, with all the dependencies installed, execute this command:
 
																 ## Inference Safety Checker
															
 
																 When running the regular inference script with prompts, Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be show, else a message with the error will be show, with the word unsafe and a comma separated list of categories infringed. Llama Guard is always loaded quantized using Hugging Face Transformers library.
															
 
																+In this case, the default categories are applied by the tokenizer, using the `apply_chat_template` method.
															
 
																+
															
 
																 Use this command for testing with a quantized Llama model, modifying the values accordingly:
															
 
																 `python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization --enable_llamaguard_content_safety`
															
--- a/examples/llama_guard/inference.py
+++ b/examples/llama_guard/inference.py
@@ -2,7 +2,7 @@ import fire
 
																 from transformers import AutoTokenizer, AutoModelForCausalLM
															
 
																-from llama_recipes.inference.prompt_format import build_prompt, create_conversation, LLAMA_GUARD_CATEGORY
															
 
																+from llama_recipes.inference.prompt_format_utils import build_prompt, create_conversation, LLAMA_GUARD_CATEGORY
															
 
																 from typing import List, Tuple
															
 
																 from enum import Enum