1 year ago · 95418fcabc
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -1,11 +0,0 @@
 
				-{
			
 
				-    "python.testing.unittestArgs": [
			
 
				-        "-v",
			
 
				-        "-s",
			
 
				-        "./tests",
			
 
				-        "-p",
			
 
				-        "test_*.py"
			
 
				-    ],
			
 
				-    "python.testing.pytestEnabled": false,
			
 
				-    "python.testing.unittestEnabled": true
			
 
				-}
			
--- a/README.md
+++ b/README.md
--- a/demo_apps/README.md
+++ b/demo_apps/README.md
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -16,7 +16,7 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
				 
			
 
				 4. Can I add custom datasets?
			
 
				 
			
 
				-    Yes, you can find more information on how to do that [here](Dataset.md).
			
 
				+    Yes, you can find more information on how to do that [here](../recipes/finetuning/datasets/README.md).
			
 
				 
			
 
				 5. What are the hardware SKU requirements for deploying these models?
			
 
				 
			
--- a/docs/images/llama2-gradio.png
+++ b/docs/images/llama2-gradio.png
--- a/docs/images/llama2-streamlit.png
+++ b/docs/images/llama2-streamlit.png
--- a/docs/images/llama2-streamlit2.png
+++ b/docs/images/llama2-streamlit2.png
--- a/docs/images/messenger_api_settings.png
+++ b/docs/images/messenger_api_settings.png
--- a/docs/images/messenger_llama_arch.jpg
+++ b/docs/images/messenger_llama_arch.jpg
--- a/docs/images/whatsapp_dashboard.jpg
+++ b/docs/images/whatsapp_dashboard.jpg
--- a/docs/images/whatsapp_llama_arch.jpg
+++ b/docs/images/whatsapp_llama_arch.jpg
--- a/docs/inference.md
+++ b/docs/inference.md
@@ -1,168 +0,0 @@
 
				-# Inference
			
 
				-
			
 
				-For inference we have provided an [inference script](../examples/inference.py). Depending on the type of finetuning performed during training the [inference script](../examples/inference.py) takes different arguments.
			
 
				-To finetune all model parameters the output dir of the training has to be given as --model_name argument.
			
 
				-In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
			
 
				-Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
			
 
				-
			
 
				-**Content Safety**
			
 
				-The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
			
 
				-
			
 
				-**Note**
			
 
				-If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
			
 
				-
			
 
				-Examples:
			
 
				-
			
 
				- ```bash
			
 
				-# Full finetuning of all parameters
			
 
				-cat <test_prompt_file> | python examples/inference.py --model_name <training_config.output_dir> --use_auditnlg
			
 
				-# PEFT method
			
 
				-cat <test_prompt_file> | python examples/inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
			
 
				-# prompt as parameter
			
 
				-python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
			
 
				- ```
			
 
				-The example folder contains test prompts for summarization use-case:
			
 
				-```
			
 
				-examples/samsum_prompt.txt
			
 
				-...
			
 
				-```
			
 
				-
			
 
				-**Note**
			
 
				-Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
			
 
				-
			
 
				-```python
			
 
				-tokenizer.add_special_tokens(
			
 
				-        {
			
 
				-
			
 
				-            "pad_token": "<PAD>",
			
 
				-        }
			
 
				-    )
			
 
				-model.resize_token_embeddings(model.config.vocab_size + 1)
			
 
				-```
			
 
				-Padding would be required for batch inference. In this this [example](../examples/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
			
 
				-
			
 
				-### Chat completion
			
 
				-The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
			
 
				-
			
 
				-```bash
			
 
				-python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg
			
 
				-
			
 
				-```
			
 
				-### Code Llama
			
 
				-
			
 
				-Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
			
 
				-
			
 
				-Find the scripts to run Code Llama [here](../examples/code_llama/), where there are two examples of running code completion and infilling.
			
 
				-
			
 
				-**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
			
 
				-
			
 
				-Make sure to install Transformers from source for now
			
 
				-
			
 
				-```bash
			
 
				-
			
 
				-pip install git+https://github.com/huggingface/transformers
			
 
				-
			
 
				-```
			
 
				-
			
 
				-To run the code completion example:
			
 
				-
			
 
				-```bash
			
 
				-
			
 
				-python examples/code_llama/code_completion_example.py --model_name MODEL_NAME  --prompt_file examples/code_llama/code_completion_prompt.txt --temperature 0.2 --top_p 0.9
			
 
				-
			
 
				-```
			
 
				-
			
 
				-To run the code infilling example:
			
 
				-
			
 
				-```bash
			
 
				-
			
 
				-python examples/code_llama/code_infilling_example.py --model_name MODEL_NAME --prompt_file examples/code_llama/code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
			
 
				-
			
 
				-```
			
 
				-To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
			
 
				-
			
 
				-```bash
			
 
				-
			
 
				-python examples/code_llama/code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
			
 
				-
			
 
				-```
			
 
				-You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
			
 
				-
			
 
				-### Llama Guard
			
 
				-
			
 
				-Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
			
 
				-
			
 
				-Find the inference script for Llama Guard [here](../examples/llama_guard/).
			
 
				-
			
 
				-**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
			
 
				-
			
 
				-Edit [inference.py](../examples/llama_guard/inference.py) to add test prompts for Llama Guard and execute it with this command:
			
 
				-
			
 
				-`python examples/llama_guard/inference.py`
			
 
				-
			
 
				-## Flash Attention and Xformer Memory Efficient Kernels
			
 
				-
			
 
				-Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
			
 
				-
			
 
				-```bash
			
 
				-python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
			
 
				-
			
 
				-python examples/inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
			
 
				-
			
 
				-```
			
 
				-
			
 
				-## Loading back FSDP checkpoints
			
 
				-
			
 
				-In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
			
 
				-**To convert the checkpoint use the following command**:
			
 
				-
			
 
				-This is helpful if you have fine-tuned you model using FSDP only as follows:
			
 
				-
			
 
				-```bash
			
 
				-torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
			
 
				-```
			
 
				-Then convert your FSDP checkpoint to HuggingFace checkpoints using:
			
 
				-```bash
			
 
				- python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
			
 
				-
			
 
				- # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
			
 
				- ```
			
 
				-By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
			
 
				-
			
 
				-Then run inference using:
			
 
				-
			
 
				-```bash
			
 
				-python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
			
 
				-
			
 
				-```
			
 
				-
			
 
				-## Prompt Llama 2
			
 
				-
			
 
				-As outlined by [this blog by Hugging Face](https://huggingface.co/blog/llama2#how-to-prompt-llama-2), you can use the template below to prompt Llama 2 chat models. Review the [blog article](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) for more information.
			
 
				-
			
 
				-```
			
 
				-<s>[INST] <<SYS>>
			
 
				-{{ system_prompt }}
			
 
				-<</SYS>>
			
 
				-
			
 
				-{{ user_message }} [/INST]
			
 
				-
			
 
				-```
			
 
				-
			
 
				-## Other Inference Options
			
 
				-
			
 
				-Alternate inference options include:
			
 
				-
			
 
				-[**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
			
 
				-To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
			
 
				-Once installed, you can use the vllm/inference.py script provided [here](../examples/vllm/inference.py).
			
 
				-
			
 
				-Below is an example of how to run the vLLM_inference.py script found within the inference folder.
			
 
				-
			
 
				-``` bash
			
 
				-python examples/vllm/inference.py --model_name <PATH/TO/MODEL/7B>
			
 
				-```
			
 
				-
			
 
				-[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../examples/hf_text_generation_inference/README.md).
			
 
				-
			
 
				-[Here](../demo_apps/llama-on-prem.md) is a complete tutorial on how to use vLLM and TGI to deploy Llama 2 on-prem and interact with the Llama API services.
			
--- a/docs/multi_gpu.md
+++ b/docs/multi_gpu.md
@@ -9,7 +9,7 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 
				 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
			
 
				 
			
 
				 ## Requirements 
			
 
				-To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`examples/finetuning.py`](../examples/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
			
 
				+To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
			
 
				 
			
 
				 **Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
			
 
				 
			
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,41 +0,0 @@
 
				-# Examples
			
 
				-
			
 
				-This folder contains finetuning and inference examples for Llama 2, Code Llama and (Purple Llama](https://ai.meta.com/llama/purple-llama/). For the full documentation on these examples please refer to [docs/inference.md](../docs/inference.md)
			
 
				-
			
 
				-## Finetuning
			
 
				-
			
 
				-Please refer to the main [README.md](../README.md) for information on how to use the [finetuning.py](./finetuning.py) script.
			
 
				-After installing the llama-recipes package through [pip](../README.md#installation) you can also invoke the finetuning in two ways:
			
 
				-```
			
 
				-python -m llama_recipes.finetuning <parameters>
			
 
				-
			
 
				-python examples/finetuning.py <parameters>
			
 
				-```
			
 
				-Please see [README.md](../README.md) for details.
			
 
				-
			
 
				-## Inference
			
 
				-So far, we have provide the following inference examples:
			
 
				-
			
 
				-1. [inference script](./inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models. It also demonstrates safety features to protect the user from toxic or harmful content.
			
 
				-
			
 
				-2. [vllm/inference.py](./vllm/inference.py) script takes advantage of vLLM's paged attention concept for low latency.
			
 
				-
			
 
				-3. The [hf_text_generation_inference](./hf_text_generation_inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
			
 
				-
			
 
				-4. A [chat completion](./chat_completion/chat_completion.py) example highlighting the handling of chat dialogs.
			
 
				-
			
 
				-5. [Code Llama](./code_llama/) folder which provides examples for [code completion](./code_llama/code_completion_example.py), [code infilling](./code_llama/code_infilling_example.py) and [Llama2 70B code instruct](./code_llama/code_instruct_example.py).
			
 
				-
			
 
				-6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) and the [Purple Llama Using OctoAI](./Purple_Llama_OctoAI.ipynb) are notebooks that shows how to use Llama Guard model on Anyscale and OctoAI to classify user inputs as safe or unsafe.
			
 
				-
			
 
				-7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference executions, both for the user input and model output.
			
 
				-
			
 
				-For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
			
 
				-
			
 
				-**Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
			
 
				-
			
 
				-**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
			
 
				-
			
 
				-## Train on custom dataset
			
 
				-To show how to train a model on a custom dataset we provide an example to generate a custom dataset in [custom_dataset.py](./custom_dataset.py).
			
 
				-The usage of the custom dataset is further described in the datasets [README](../docs/Dataset.md#training-on-custom-data).
			
--- a/recipes/README.md
+++ b/recipes/README.md
--- a/recipes/benchmarks/inference/README.md
+++ b/recipes/benchmarks/inference/README.md
--- a/recipes/benchmarks/inference/on-prem/README.md
+++ b/recipes/benchmarks/inference/on-prem/README.md
--- a/recipes/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
+++ b/recipes/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
--- a/recipes/benchmarks/inference/on-prem/vllm/input.jsonl
+++ b/recipes/benchmarks/inference/on-prem/vllm/input.jsonl
--- a/recipes/benchmarks/inference/on-prem/vllm/parameters.json
+++ b/recipes/benchmarks/inference/on-prem/vllm/parameters.json
--- a/recipes/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
+++ b/recipes/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
--- a/recipes/benchmarks/inference/tokenizer/special_tokens_map.json
+++ b/recipes/benchmarks/inference/tokenizer/special_tokens_map.json
--- a/recipes/benchmarks/inference/tokenizer/tokenizer.json
+++ b/recipes/benchmarks/inference/tokenizer/tokenizer.json
--- a/recipes/benchmarks/inference/tokenizer/tokenizer.model
+++ b/recipes/benchmarks/inference/tokenizer/tokenizer.model
--- a/recipes/benchmarks/inference/tokenizer/tokenizer_config.json
+++ b/recipes/benchmarks/inference/tokenizer/tokenizer_config.json
--- a/recipes/benchmarks/inference_throughput/cloud-api/README.md
+++ b/recipes/benchmarks/inference_throughput/cloud-api/README.md
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
--- a/recipes/benchmarks/inference_throughput/requirements.txt
+++ b/recipes/benchmarks/inference_throughput/requirements.txt
--- a/recipes/code_llama/README.md
+++ b/recipes/code_llama/README.md
@@ -0,0 +1,39 @@
 
				+# Code Llama
			
 
				+
			
 
				+Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
			
 
				+
			
 
				+Find the scripts to run Code Llama, where there are two examples of running code completion and infilling.
			
 
				+
			
 
				+**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
			
 
				+
			
 
				+Make sure to install Transformers from source for now
			
 
				+
			
 
				+```bash
			
 
				+
			
 
				+pip install git+https://github.com/huggingface/transformers
			
 
				+
			
 
				+```
			
 
				+
			
 
				+To run the code completion example:
			
 
				+
			
 
				+```bash
			
 
				+
			
 
				+python code_completion_example.py --model_name MODEL_NAME  --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9
			
 
				+
			
 
				+```
			
 
				+
			
 
				+To run the code infilling example:
			
 
				+
			
 
				+```bash
			
 
				+
			
 
				+python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
			
 
				+
			
 
				+```
			
 
				+To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
			
 
				+
			
 
				+```bash
			
 
				+
			
 
				+python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
			
 
				+
			
 
				+```
			
 
				+You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
			
--- a/examples/code_llama/code_completion_example.py
+++ b/examples/code_llama/code_completion_example.py
--- a/examples/code_llama/code_completion_prompt.txt
+++ b/examples/code_llama/code_completion_prompt.txt
--- a/examples/code_llama/code_infilling_example.py
+++ b/examples/code_llama/code_infilling_example.py
--- a/examples/code_llama/code_infilling_prompt.txt
+++ b/examples/code_llama/code_infilling_prompt.txt
--- a/examples/code_llama/code_instruct_example.py
+++ b/examples/code_llama/code_instruct_example.py
--- a/recipes/evaluation/README.md
+++ b/recipes/evaluation/README.md
--- a/recipes/evaluation/eval.py
+++ b/recipes/evaluation/eval.py
--- a/recipes/evaluation/open_llm_eval_prep.sh
+++ b/recipes/evaluation/open_llm_eval_prep.sh
--- a/recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
--- a/recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
--- a/recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
+++ b/recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
--- a/recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
--- a/recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
--- a/recipes/finetuning/LLM_finetuning_overview.md
+++ b/recipes/finetuning/LLM_finetuning_overview.md
--- a/recipes/finetuning/README.md
+++ b/recipes/finetuning/README.md
@@ -0,0 +1,90 @@
 
				+# Finetuning Llama
			
 
				+
			
 
				+This folder contains instructions to fine-tune Llama 2 on a 
			
 
				+* [single-GPU setup](./singlegpu_finetuning.md)
			
 
				+* [multi-GPU setup](./multigpu_finetuning.md) 
			
 
				+
			
 
				+using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
			
 
				+
			
 
				+If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
			
 
				+
			
 
				+> [!TIP]
			
 
				+> If you want to try finetuning Llama 2 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
			
 
				+
			
 
				+
			
 
				+## How to configure finetuning settings?
			
 
				+
			
 
				+> [!TIP]
			
 
				+> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
			
 
				+
			
 
				+
			
 
				+* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
			
 
				+
			
 
				+It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
			
 
				+
			
 
				+```python
			
 
				+
			
 
				+model_name: str="PATH/to/LLAMA 2/7B"
			
 
				+enable_fsdp: bool= False
			
 
				+run_validation: bool=True
			
 
				+batch_size_training: int=4
			
 
				+gradient_accumulation_steps: int=1
			
 
				+num_epochs: int=3
			
 
				+num_workers_dataloader: int=2
			
 
				+lr: float=2e-4
			
 
				+weight_decay: float=0.0
			
 
				+gamma: float= 0.85
			
 
				+use_fp16: bool=False
			
 
				+mixed_precision: bool=True
			
 
				+val_batch_size: int=4
			
 
				+dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
			
 
				+peft_method: str = "lora" # None , llama_adapter, prefix
			
 
				+use_peft: bool=False
			
 
				+output_dir: str = "./ft-output"
			
 
				+freeze_layers: bool = False
			
 
				+num_freeze_layers: int = 1
			
 
				+quantization: bool = False
			
 
				+save_model: bool = False
			
 
				+dist_checkpoint_root_folder: str="model_checkpoints"
			
 
				+dist_checkpoint_folder: str="fine-tuned"
			
 
				+save_optimizer: bool=False
			
 
				+
			
 
				+```
			
 
				+
			
 
				+* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
			
 
				+
			
 
				+* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
			
 
				+
			
 
				+* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
			
 
				+
			
 
				+    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
			
 
				+
			
 
				+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
			
 
				+
			
 
				+    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
			
 
				+        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
			
 
				+
			
 
				+        * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
			
 
				+
			
 
				+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
			
 
				+
			
 
				+        * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
			
 
				+
			
 
				+* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
			
 
				+
			
 
				+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
			
 
				+
			
 
				+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
			
 
				+
			
 
				+
			
 
				+## Weights & Biases Experiment Tracking
			
 
				+
			
 
				+You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
			
 
				+
			
 
				+```bash
			
 
				+python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model --use_wandb
			
 
				+```
			
 
				+You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below. 
			
 
				+<div style="display: flex;">
			
 
				+    <img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
			
 
				+</div>
			
--- a/docs/Dataset.md
+++ b/docs/Dataset.md
--- a/recipes/finetuning/datasets/custom_dataset.py
+++ b/recipes/finetuning/datasets/custom_dataset.py
--- a/recipes/finetuning/finetuning.py
+++ b/recipes/finetuning/finetuning.py
--- a/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
+++ b/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
--- a/examples/multi_node.slurm
+++ b/examples/multi_node.slurm
@@ -32,5 +32,5 @@ export CUDA_LAUNCH_BLOCKING=0
 
				 export NCCL_SOCKET_IFNAME="ens"
			
 
				 export FI_EFA_USE_DEVICE_RDMA=1
			
 
				 
			
 
				-srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 examples/finetuning.py  --enable_fsdp --use_peft --peft_method lora
			
 
				+srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora
			
 
				 
			
--- a/recipes/finetuning/multigpu_finetuning.md
+++ b/recipes/finetuning/multigpu_finetuning.md
@@ -0,0 +1,111 @@
 
				+# Fine-tuning with Multi GPU
			
 
				+This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
			
 
				+
			
 
				+
			
 
				+## Requirements
			
 
				+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
			
 
				+
			
 
				+We will also need 2 packages:
			
 
				+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
			
 
				+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
			
 
				+
			
 
				+> [!NOTE]  
			
 
				+> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
			
 
				+>
			
 
				+> INT8 quantization is not currently supported in FSDP
			
 
				+
			
 
				+
			
 
				+## How to run it
			
 
				+Get access to a machine with multiple GPUs (in this case we tested with 4 A100 and A10s).
			
 
				+
			
 
				+### With FSDP + PEFT
			
 
				+
			
 
				+<details open>
			
 
				+<summary>Single-node Multi-GPU</summary>
			
 
				+
			
 
				+    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+</details>
			
 
				+
			
 
				+<details>
			
 
				+<summary>Multi-node Multi-GPU</summary>
			
 
				+Here we use a slurm script to schedule a job with slurm over multiple nodes.
			
 
				+    
			
 
				+    # Change the num nodes and GPU per nodes in the script before running.
			
 
				+    sbatch ./multi_node.slurm
			
 
				+
			
 
				+</details>
			
 
				+
			
 
				+
			
 
				+We use `torchrun` to spawn multiple processes for FSDP.
			
 
				+
			
 
				+The args used in the command above are:
			
 
				+* `--enable_fsdp` boolean flag to enable FSDP  in the script
			
 
				+* `--use_peft` boolean flag to enable PEFT methods in the script
			
 
				+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
			
 
				+
			
 
				+
			
 
				+### With only FSDP
			
 
				+If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
			
 
				+
			
 
				+```bash
			
 
				+torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
			
 
				+```
			
 
				+
			
 
				+### Using less CPU memory (FSDP on 70B model)
			
 
				+
			
 
				+If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
			
 
				+
			
 
				+```bash
			
 
				+torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
			
 
				+```
			
 
				+
			
 
				+
			
 
				+
			
 
				+## Running with different datasets
			
 
				+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
			
 
				+
			
 
				+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
			
 
				+
			
 
				+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
			
 
				+
			
 
				+```bash
			
 
				+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
			
 
				+```
			
 
				+
			
 
				+* `samsum_dataset`
			
 
				+
			
 
				+To run with each of the datasets set the `dataset` flag in the command as shown below:
			
 
				+
			
 
				+```bash
			
 
				+# grammer_dataset
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+# alpaca_dataset
			
 
				+
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+
			
 
				+# samsum_dataset
			
 
				+
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+```
			
 
				+
			
 
				+
			
 
				+
			
 
				+## [TIP] Slow interconnect between nodes?
			
 
				+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag. 
			
 
				+
			
 
				+HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
			
 
				+
			
 
				+This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
			
 
				+
			
 
				+```bash
			
 
				+
			
 
				+torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
			
 
				+
			
 
				+```
			
 
				+
			
 
				+
			
 
				+
			
--- a/recipes/finetuning/singlegpu_finetuning.md
+++ b/recipes/finetuning/singlegpu_finetuning.md
@@ -0,0 +1,62 @@
 
				+# Fine-tuning with Single GPU
			
 
				+This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
			
 
				+
			
 
				+These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
			
 
				+
			
 
				+
			
 
				+## Requirements
			
 
				+
			
 
				+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
			
 
				+
			
 
				+To run fine-tuning on a single GPU, we will make use of two packages:
			
 
				+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
			
 
				+2. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for int8 quantization.
			
 
				+
			
 
				+
			
 
				+## How to run it?
			
 
				+
			
 
				+```bash
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+```
			
 
				+The args used in the command above are:
			
 
				+
			
 
				+* `--use_peft` boolean flag to enable PEFT methods in the script
			
 
				+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
			
 
				+* `--quantization` boolean flag to enable int8 quantization
			
 
				+
			
 
				+> [!NOTE]  
			
 
				+> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
			
 
				+
			
 
				+ 
			
 
				+### How to run with different datasets?
			
 
				+
			
 
				+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
			
 
				+
			
 
				+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
			
 
				+
			
 
				+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
			
 
				+
			
 
				+
			
 
				+```bash
			
 
				+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
			
 
				+```
			
 
				+
			
 
				+* `samsum_dataset`
			
 
				+
			
 
				+to run with each of the datasets set the `dataset` flag in the command as shown below:
			
 
				+
			
 
				+```bash
			
 
				+# grammer_dataset
			
 
				+
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+# alpaca_dataset
			
 
				+
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+
			
 
				+# samsum_dataset
			
 
				+
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+
			
 
				+```
			
--- a/demo_apps/Llama2_Gradio.ipynb
+++ b/demo_apps/Llama2_Gradio.ipynb
@@ -1,5 +1,15 @@
 
				 {
			
 
				  "cells": [
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "e4532411",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# TODO REFACTOR: Integrate code from _legacy/inference.py into this notebook"
			
 
				+   ]
			
 
				+  },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				    "id": "47a9adb3",
			
--- a/recipes/inference/llama_web_ui/README.md
+++ b/recipes/inference/llama_web_ui/README.md
@@ -0,0 +1,25 @@
 
				+## Quick Web UI for Llama2 Chat
			
 
				+If you prefer to see Llama2 in action in a web UI, instead of the notebooks above, you can try one of the two methods:
			
 
				+
			
 
				+### Running [Streamlit](https://streamlit.io/) with Llama2
			
 
				+Open a Terminal, run the following commands:
			
 
				+```
			
 
				+pip install streamlit langchain replicate
			
 
				+git clone https://github.com/facebookresearch/llama-recipes
			
 
				+cd llama-recipes/llama-demo-apps
			
 
				+```
			
 
				+
			
 
				+Replace the `<your replicate api token>` in `streamlit_llama2.py` with your API token created [here](https://replicate.com/account/api-tokens) - for more info, see the note [above](#replicate_note).
			
 
				+
			
 
				+Then run the command `streamlit run streamlit_llama2.py` and you'll see on your browser the following UI with question and answer - you can enter new text question, click Submit, and see Llama2's answer:
			
 
				+
			
 
				+![](../../../docs/images/llama2-streamlit.png)
			
 
				+![](../../../docs/images/llama2-streamlit2.png)
			
 
				+
			
 
				+### Running [Gradio](https://www.gradio.app/) with Llama2 (using [Replicate](Llama2_Gradio.ipynb) or [OctoAI](../../llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb))
			
 
				+
			
 
				+To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`. For more info, on how to get set up with a token to power these apps, see the note on [Replicate](../../README.md#replicate_note) and [OctoAI](../../README.md##octoai_note).
			
 
				+
			
 
				+Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
			
 
				+
			
 
				+![](../../../docs/images/llama2-gradio.png)
			
--- a/recipes/inference/llama_web_ui/requirements.txt
+++ b/recipes/inference/llama_web_ui/requirements.txt
@@ -0,0 +1,3 @@
 
				+streamlit
			
 
				+langchain
			
 
				+replicate
			
--- a/demo_apps/streamlit_llama2.py
+++ b/demo_apps/streamlit_llama2.py
@@ -1,6 +1,8 @@
 
				 # Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
			
 
				 
			
 
				+# TODO REFACTOR: Convert this to an ipynb notebook
			
 
				+
			
 
				 import streamlit as st
			
 
				 from langchain.llms import Replicate
			
 
				 import os
			
--- a/recipes/inference/local_inference/README.md
+++ b/recipes/inference/local_inference/README.md
@@ -0,0 +1,87 @@
 
				+# Local Inference
			
 
				+
			
 
				+For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
			
 
				+To finetune all model parameters the output dir of the training has to be given as --model_name argument.
			
 
				+In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
			
 
				+Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
			
 
				+
			
 
				+**Content Safety**
			
 
				+The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
			
 
				+
			
 
				+**Note**
			
 
				+If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
			
 
				+
			
 
				+Examples:
			
 
				+
			
 
				+ ```bash
			
 
				+# Full finetuning of all parameters
			
 
				+cat <test_prompt_file> | python inference.py --model_name <training_config.output_dir> --use_auditnlg
			
 
				+# PEFT method
			
 
				+cat <test_prompt_file> | python inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
			
 
				+# prompt as parameter
			
 
				+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
			
 
				+ ```
			
 
				+The  folder contains test prompts for summarization use-case:
			
 
				+```
			
 
				+samsum_prompt.txt
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+**Note**
			
 
				+Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
			
 
				+
			
 
				+```python
			
 
				+tokenizer.add_special_tokens(
			
 
				+        {
			
 
				+
			
 
				+            "pad_token": "<PAD>",
			
 
				+        }
			
 
				+    )
			
 
				+model.resize_token_embeddings(model.config.vocab_size + 1)
			
 
				+```
			
 
				+Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
			
 
				+
			
 
				+
			
 
				+## Chat completion
			
 
				+The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
			
 
				+
			
 
				+```bash
			
 
				+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg
			
 
				+
			
 
				+```
			
 
				+
			
 
				+## Flash Attention and Xformer Memory Efficient Kernels
			
 
				+
			
 
				+Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
			
 
				+
			
 
				+```bash
			
 
				+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
			
 
				+
			
 
				+python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
			
 
				+
			
 
				+```
			
 
				+
			
 
				+## Loading back FSDP checkpoints
			
 
				+
			
 
				+In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
			
 
				+**To convert the checkpoint use the following command**:
			
 
				+
			
 
				+This is helpful if you have fine-tuned you model using FSDP only as follows:
			
 
				+
			
 
				+```bash
			
 
				+torchrun --nnodes 1 --nproc_per_node 8  recipes/finetuning/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
			
 
				+```
			
 
				+Then convert your FSDP checkpoint to HuggingFace checkpoints using:
			
 
				+```bash
			
 
				+ python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
			
 
				+
			
 
				+ # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
			
 
				+ ```
			
 
				+By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
			
 
				+
			
 
				+Then run inference using:
			
 
				+
			
 
				+```bash
			
 
				+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
			
 
				+
			
 
				+```
			
--- a/recipes/inference/local_inference/chat_completion/chat_completion.py
+++ b/recipes/inference/local_inference/chat_completion/chat_completion.py
--- a/recipes/inference/local_inference/chat_completion/chats.json
+++ b/recipes/inference/local_inference/chat_completion/chats.json
--- a/recipes/inference/local_inference/inference.py
+++ b/recipes/inference/local_inference/inference.py
--- a/recipes/inference/local_inference/samsum_prompt.txt
+++ b/recipes/inference/local_inference/samsum_prompt.txt
--- a/recipes/inference/model_servers/README.md
+++ b/recipes/inference/model_servers/README.md
@@ -0,0 +1,4 @@
 
				+## [Running Llama2 On-Prem with vLLM and TGI](llama-on-prem.md)
			
 
				+This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 2 on-prem apps.
			
 
				+
			
 
				+\* To run a quantized Llama2 model on iOS and Android, you can use  the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)).
			
--- a/recipes/inference/model_servers/hf_text_generation_inference/README.md
+++ b/recipes/inference/model_servers/hf_text_generation_inference/README.md
--- a/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
+++ b/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
--- a/recipes/inference/model_servers/llama-on-prem.md
+++ b/recipes/inference/model_servers/llama-on-prem.md
--- a/recipes/inference/model_servers/vllm/inference.py
+++ b/recipes/inference/model_servers/vllm/inference.py
--- a/recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
+++ b/recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/Getting_to_know_Llama.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/Getting_to_know_Llama.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/HelloLlamaCloud.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/HelloLlamaCloud.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/LiveData.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/LiveData.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
--- a/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/data/Llama
+++ b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/data/Llama
--- a/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt
+++ b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt
--- a/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
+++ b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
--- a/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
+++ b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
--- a/recipes/llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb
+++ b/recipes/llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb
--- a/recipes/llama_api_providers/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb
+++ b/recipes/llama_api_providers/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb
--- a/recipes/llama_api_providers/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb
+++ b/recipes/llama_api_providers/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb
--- a/recipes/llama_api_providers/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb
+++ b/recipes/llama_api_providers/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb
--- a/recipes/quickstart/Getting_to_know_Llama.ipynb
+++ b/recipes/quickstart/Getting_to_know_Llama.ipynb
--- a/recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb
+++ b/recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb
--- a/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb
+++ b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb
@@ -0,0 +1,304 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Running Llama2 on Google Colab using Hugging Face transformers library\n",
			
 
				+    "This notebook goes over how you can set up and run Llama2 using Hugging Face transformers library"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Steps at a glance:\n",
			
 
				+    "This demo showcases how to run the example with already converted Llama 2 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
			
 
				+    "\n",
			
 
				+    "To use already converted weights, start here:\n",
			
 
				+    "1. Request download of model weights from the Llama website\n",
			
 
				+    "2. Prepare the script\n",
			
 
				+    "3. Run the example\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
			
 
				+    "1. Request download of model weights from the Llama website\n",
			
 
				+    "2. Clone the llama repo and get the weights\n",
			
 
				+    "3. Convert the model weights\n",
			
 
				+    "4. Prepare the script\n",
			
 
				+    "5. Run the example"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Using already converted weights"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 1. Request download of model weights from the Llama website\n",
			
 
				+    "Request download of model weights from the Llama website\n",
			
 
				+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
			
 
				+    "\n",
			
 
				+    "Fill  the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 2. Prepare the script\n",
			
 
				+    "\n",
			
 
				+    "We will install the Transformers library and Accelerate library for our demo.\n",
			
 
				+    "\n",
			
 
				+    "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
			
 
				+    "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!pip install transformers\n",
			
 
				+    "!pip install accelerate"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from transformers import AutoTokenizer\n",
			
 
				+    "import transformers\n",
			
 
				+    "import torch"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 7b chat model `meta-llama/Llama-2-7b-chat-hf`."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "model = \"meta-llama/Llama-2-7b-chat-hf\"\n",
			
 
				+    "tokenizer = AutoTokenizer.from_pretrained(model)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "pipeline = transformers.pipeline(\n",
			
 
				+    "\"text-generation\",\n",
			
 
				+    "      model=model,\n",
			
 
				+    "      torch_dtype=torch.float16,\n",
			
 
				+    " device_map=\"auto\",\n",
			
 
				+    ")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 3. Run the example\n",
			
 
				+    "\n",
			
 
				+    "Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
			
 
				+    "\n",
			
 
				+    "Let’s also generate a text sequence based on the input that we provide. "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "sequences = pipeline(\n",
			
 
				+    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
			
 
				+    "    do_sample=True,\n",
			
 
				+    "    top_k=10,\n",
			
 
				+    "    num_return_sequences=1,\n",
			
 
				+    "    eos_token_id=tokenizer.eos_token_id,\n",
			
 
				+    "    truncation = True,\n",
			
 
				+    "    max_length=400,\n",
			
 
				+    ")\n",
			
 
				+    "\n",
			
 
				+    "for seq in sequences:\n",
			
 
				+    "    print(f\"Result: {seq['generated_text']}\")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "\n",
			
 
				+    "### Downloading and converting weights to Hugging Face format"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 1. Request download of model weights from the Llama website\n",
			
 
				+    "Request download of model weights from the Llama website\n",
			
 
				+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
			
 
				+    "\n",
			
 
				+    "Fill  the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 2. Clone the llama repo and get the weights\n",
			
 
				+    "Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 7B-chat weights. This will download the tokenizer.model, and a directory llama-2-7b-chat with the weights in it.\n",
			
 
				+    "\n",
			
 
				+    "This example demonstrates a llama2 model with 7B-chat parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 3. Convert the model weights\n",
			
 
				+    "\n",
			
 
				+    "* Create a link to the tokenizer:\n",
			
 
				+    "Run `ln -h ./tokenizer.model ./llama-2-7b-chat/tokenizer.model`  \n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "* Convert the model weights to run with Hugging Face:``TRANSFORM=`python -c \"import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')\"``\n",
			
 
				+    "\n",
			
 
				+    "* Then run: `pip install protobuf && python $TRANSFORM --input_dir ./llama-2-7b-chat --model_size 7B --output_dir ./llama-2-7b-chat-hf`\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "\n",
			
 
				+    "#### 4. Prepare the script\n",
			
 
				+    "Import the following necessary modules in your script: \n",
			
 
				+    "* `LlamaForCausalLM` is the Llama 2 model class\n",
			
 
				+    "* `LlamaTokenizer` prepares your prompt for the model to process\n",
			
 
				+    "* `pipeline` is an abstraction to generate model outputs\n",
			
 
				+    "* `torch` allows us to use PyTorch and specify the datatype we’d like to use."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import torch\n",
			
 
				+    "import transformers\n",
			
 
				+    "from transformers import LlamaForCausalLM, LlamaTokenizer\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "model_dir = \"./llama-2-7b-chat-hf\"\n",
			
 
				+    "model = LlamaForCausalLM.from_pretrained(model_dir)\n",
			
 
				+    "\n",
			
 
				+    "tokenizer = LlamaTokenizer.from_pretrained(model_dir)\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`)  among various other options. \n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "pipeline = transformers.pipeline(\n",
			
 
				+    "    \"text-generation\",\n",
			
 
				+    "    model=model,\n",
			
 
				+    "    tokenizer=tokenizer,\n",
			
 
				+    "    torch_dtype=torch.float16,\n",
			
 
				+    "    device_map=\"auto\",\n",
			
 
				+    ")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
			
 
				+    "\n",
			
 
				+    "By changing `max_length`, you can specify how long you’d like the generated response to be. \n",
			
 
				+    "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
			
 
				+    "\n",
			
 
				+    "In your script, add the following to provide input, and information on how to run the pipeline:\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "#### 5. Run the example"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "sequences = pipeline(\n",
			
 
				+    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
			
 
				+    "    do_sample=True,\n",
			
 
				+    "    top_k=10,\n",
			
 
				+    "    num_return_sequences=1,\n",
			
 
				+    "    eos_token_id=tokenizer.eos_token_id,\n",
			
 
				+    "    max_length=400,\n",
			
 
				+    ")\n",
			
 
				+    "for seq in sequences:\n",
			
 
				+    "    print(f\"{seq['generated_text']}\")\n"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "name": "python",
			
 
				+   "version": "3.8.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 2
			
 
				+}
			
--- a/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac.ipynb
+++ b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac.ipynb
@@ -0,0 +1,219 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Running Llama2 on Mac\n",
			
 
				+    "This notebook goes over how you can set up and run Llama2 locally on a Mac using llama-cpp-python and the llama-cpp's quantized Llama2 model. It also goes over how to use LangChain to ask Llama general questions"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Steps at a glance:\n",
			
 
				+    "1. Use CMAKE and install required packages\n",
			
 
				+    "2. Request download of model weights from the Llama website\n",
			
 
				+    "3. Clone the llama repo and get the weights\n",
			
 
				+    "4. Clone the llamacpp repo and quantize the model\n",
			
 
				+    "5. Prepare the script\n",
			
 
				+    "6. Run the example\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "\n",
			
 
				+    "#### 1. Use CMAKE and install required packages\n",
			
 
				+    "\n",
			
 
				+    "Type the following command:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "#CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1: sets the appropriate build configuration options for the llama-cpp-python package \n",
			
 
				+    "#and enables the use of Metal in Mac and forces the use of CMake as the build system.\n",
			
 
				+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python\n",
			
 
				+    "\n",
			
 
				+    "#pip install llama-cpp-python: installs the llama-cpp-python package and its dependencies:\n",
			
 
				+    "!pip install pypdf sentence-transformers chromadb langchain"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "If running without a Jupyter notebook, use the command without the `!`"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "A brief look at the installed libraries:\n",
			
 
				+    "- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) a simple Python bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) library\n",
			
 
				+    "- pypdf gives us the ability to work with pdfs\n",
			
 
				+    "- sentence-transformers for text embeddings\n",
			
 
				+    "- chromadb gives us database capabilities \n",
			
 
				+    "- langchain provides necessary RAG tools for this demo"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "\n",
			
 
				+    "#### 2. Request download of model weights from the Llama website\n",
			
 
				+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
			
 
				+    "Fill  the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "\n",
			
 
				+    "#### 3. Clone the llama repo and get the weights\n",
			
 
				+    "Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 13B weights. This example demonstrates a llama2 model with 13B parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "\n",
			
 
				+    "#### 4. Clone the llamacpp repo and quantize the model\n",
			
 
				+    "* Git clone the [Llamacpp repo](https://github.com/ggerganov/llama.cpp). \n",
			
 
				+    "* Enter the repo:\n",
			
 
				+    "`cd llama.cpp`\n",
			
 
				+    "* Install requirements:\n",
			
 
				+    "`python3 -m pip install -r requirements.txt`\n",
			
 
				+    "* Convert the weights:\n",
			
 
				+    "`python convert.py <path_to_your_downloaded_llama-2-13b_model>`\n",
			
 
				+    "* Run make to generate the 'quantize' method that we will use in the next step\n",
			
 
				+    "`make`\n",
			
 
				+    "* Quantize the weights:\n",
			
 
				+    "`./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0`"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "\n",
			
 
				+    "#### 5. Prepare the script\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# mentions the instance of the Llama model that we will use\n",
			
 
				+    "from langchain.llms import LlamaCpp\n",
			
 
				+    "\n",
			
 
				+    "# defines a chain of operations that can be performed on text input to generate the output using the LLM\n",
			
 
				+    "from langchain.chains import LLMChain\n",
			
 
				+    "\n",
			
 
				+    "# manages callbacks that are triggered at various stages during the execution of an LLMChain\n",
			
 
				+    "from langchain.callbacks.manager import CallbackManager\n",
			
 
				+    "\n",
			
 
				+    "# defines a callback that streams the output of the LLMChain to the console in real-time as it gets generated\n",
			
 
				+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
			
 
				+    "\n",
			
 
				+    "# allows to define prompt templates that can be used to generate custom inputs for the LLM\n",
			
 
				+    "from langchain.prompts import PromptTemplate\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "# Initialize the langchain CallBackManager. This handles callbacks from Langchain and for this example we will use \n",
			
 
				+    "# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question\n",
			
 
				+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])\n",
			
 
				+    "\n",
			
 
				+    "# Set up the model\n",
			
 
				+    "llm = LlamaCpp(\n",
			
 
				+    "    model_path=\"<path-to-llama-gguf-file>\",\n",
			
 
				+    "    temperature=0.0,\n",
			
 
				+    "    top_p=1,\n",
			
 
				+    "    n_ctx=6000,\n",
			
 
				+    "    callback_manager=callback_manager, \n",
			
 
				+    "    verbose=True,\n",
			
 
				+    ")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### 6. Run the example\n",
			
 
				+    "\n",
			
 
				+    "With the model set up, you are now ready to ask some questions. \n",
			
 
				+    "\n",
			
 
				+    "Here is an example of the simplest way to ask the model some general questions."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# Run the example\n",
			
 
				+    "question = \"who wrote the book Pride and Prejudice?\"\n",
			
 
				+    "answer = llm(question)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Alternatively, you can use LangChain's `PromptTemplate` for some flexibility in your prompts and questions. For more information on LangChain's prompt template visit this [link](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "prompt = PromptTemplate.from_template(\n",
			
 
				+    "    \"who wrote {book}?\"\n",
			
 
				+    ")\n",
			
 
				+    "chain = LLMChain(llm=llm, prompt=prompt)\n",
			
 
				+    "answer = chain.run(\"A tale of two cities\")"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.8.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 2
			
 
				+}
			
--- a/recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
+++ b/recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
--- a/recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
+++ b/recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
--- a/recipes/responsible_ai/README.md
+++ b/recipes/responsible_ai/README.md
@@ -0,0 +1,11 @@
 
				+# Llama Guard
			
 
				+
			
 
				+Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
			
 
				+
			
 
				+**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
			
 
				+
			
 
				+### Running locally
			
 
				+The [llama_guard](llama_guard) folder contains the inference script to run Llama Guard locally. Add test prompts directly to the [inference script](llama_guard/inference.py) before running it.
			
 
				+
			
 
				+### Running on the cloud
			
 
				+The notebooks [Purple_Llama_Anyscale](Purple_Llama_Anyscale.ipynb) & [Purple_Llama_OctoAI](Purple_Llama_OctoAI.ipynb) contain examples for running Llama Guard on cloud hosted endpoints.
			
--- a/examples/llama_guard/README.md
+++ b/examples/llama_guard/README.md
@@ -1,6 +1,6 @@
 
				 # Llama Guard demo
			
 
				 <!-- markdown-link-check-disable -->
			
 
				-Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
			
 
				+Llama Guard is a language model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
			
 
				 
			
 
				 This folder contains an example file to run Llama Guard inference directly. 
			
 
				 
			
--- a/recipes/responsible_ai/llama_guard/__init__.py
+++ b/recipes/responsible_ai/llama_guard/__init__.py
--- a/recipes/responsible_ai/llama_guard/inference.py
+++ b/recipes/responsible_ai/llama_guard/inference.py
--- a/recipes/use_cases/LiveData.ipynb
+++ b/recipes/use_cases/LiveData.ipynb
--- a/recipes/use_cases/RAG/HelloLlamaCloud.ipynb
+++ b/recipes/use_cases/RAG/HelloLlamaCloud.ipynb
--- a/recipes/use_cases/RAG/HelloLlamaLocal.ipynb
+++ b/recipes/use_cases/RAG/HelloLlamaLocal.ipynb
--- a/recipes/use_cases/RAG/llama2.pdf
+++ b/recipes/use_cases/RAG/llama2.pdf
--- a/recipes/use_cases/README.md
+++ b/recipes/use_cases/README.md
@@ -0,0 +1,17 @@
 
				+## VideoSummary: Ask Llama2 to Summarize a YouTube Video (using [Replicate](VideoSummary.ipynb) or [OctoAI](../llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb))
			
 
				+This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size.
			
 
				+
			
 
				+## [NBA2023-24](./text2sql/StructuredLlama.ipynb): Ask Llama2 about Structured Data
			
 
				+This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players.
			
 
				+
			
 
				+## LiveData: Ask Llama2 about Live Data (using [Replicate](LiveData.ipynb) or [OctoAI](../llama_api_providers/OctoAI_API_examples/LiveData.ipynb))
			
 
				+This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
			
 
				+
			
 
				+## [WhatsApp Chatbot](./chatbots/whatsapp_llama/whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot
			
 
				+This step-by-step tutorial shows how to use the [WhatsApp Business API](https://developers.facebook.com/docs/whatsapp/cloud-api/overview) to build a Llama-enabled WhatsApp chatbot.
			
 
				+
			
 
				+## [Messenger Chatbot](./chatbots/messenger_llama/messenger_llama2.md): Building a Llama-enabled Messenger Chatbot
			
 
				+This step-by-step tutorial shows how to use the [Messenger Platform](https://developers.facebook.com/docs/messenger-platform/overview) to build a Llama-enabled Messenger chatbot.
			
 
				+
			
 
				+### RAG Chatbot Example (running [locally](./chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb) or on [OctoAI](../llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb))
			
 
				+A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data using retrieval augmented generation (RAG). You can run Llama2 locally if you have a good enough GPU or on OctoAI if you follow the note [here](../README.md#octoai_note).
			
--- a/recipes/use_cases/VideoSummary.ipynb
+++ b/recipes/use_cases/VideoSummary.ipynb
--- a/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
+++ b/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
--- a/demo_apps/RAG_Chatbot_example/data/Llama
+++ b/demo_apps/RAG_Chatbot_example/data/Llama