Explorar el Código

Merge pull request #1 from abhilash1910/main

Upstream from main
Abhilash Majumder hace 1 año
padre
commit
6e65d151e2
Se han modificado 64 ficheros con 609 adiciones y 309 borrados
  1. 29 1
      CONTRIBUTING.md
  2. 54 24
      README.md
  3. 1 1
      UPDATES.md
  4. 3 0
      dev_requirements.txt
  5. 6 6
      docs/Dataset.md
  6. 2 2
      docs/FAQ.md
  7. 34 22
      docs/inference.md
  8. 19 25
      docs/multi_gpu.md
  9. 15 21
      docs/single_gpu.md
  10. 34 0
      examples/README.md
  11. 9 10
      inference/chat_completion.py
  12. 0 0
      examples/chat_completion/chats.json
  13. 5 5
      inference/code-llama/code_completion_example.py
  14. 0 0
      examples/code_llama/code_completion_prompt.txt
  15. 3 4
      inference/code-llama/code_infilling_example.py
  16. 0 0
      examples/code_llama/code_infilling_prompt.txt
  17. 5 3
      configs/__init__.py
  18. 3 3
      inference/hf-text-generation-inference/README.md
  19. 0 0
      examples/hf_text_generation_inference/merge_lora_weights.py
  20. 6 5
      inference/inference.py
  21. 1 1
      multi_node.slurm
  22. 0 0
      examples/quickstart.ipynb
  23. 0 0
      examples/samsum_prompt.txt
  24. 2 9
      inference/vLLM_inference.py
  25. 0 6
      ft_datasets/__init__.py
  26. 0 11
      inference/README.md
  27. 0 7
      policies/__init__.py
  28. 41 0
      pyproject.toml
  29. 4 5
      requirements.txt
  30. 27 0
      scripts/spellcheck_conf/wordlist.txt
  31. 6 0
      src/llama_recipes/configs/__init__.py
  32. 3 3
      configs/datasets.py
  33. 2 2
      configs/fsdp.py
  34. 1 1
      configs/peft.py
  35. 2 2
      configs/training.py
  36. 6 0
      src/llama_recipes/datasets/__init__.py
  37. 2 4
      ft_datasets/alpaca_dataset.py
  38. 0 1
      ft_datasets/grammar_dataset/__init__.py
  39. 2 18
      ft_datasets/grammar_dataset/grammar_dataset.py
  40. 0 0
      src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb
  41. 2 1
      ft_datasets/samsum_dataset.py
  42. 1 0
      ft_datasets/utils.py
  43. 17 16
      llama_finetuning.py
  44. 2 0
      src/llama_recipes/inference/__init__.py
  45. 2 1
      inference/chat_utils.py
  46. 4 2
      inference/checkpoint_converter_fsdp_hf.py
  47. 0 0
      src/llama_recipes/inference/model_utils.py
  48. 0 2
      inference/safety_utils.py
  49. 1 1
      model_checkpointing/__init__.py
  50. 0 0
      src/llama_recipes/model_checkpointing/checkpoint_handler.py
  51. 7 0
      src/llama_recipes/policies/__init__.py
  52. 2 6
      policies/activation_checkpointing_functions.py
  53. 0 0
      src/llama_recipes/policies/anyprecision_optimizer.py
  54. 0 4
      policies/mixed_precision.py
  55. 1 15
      policies/wrapping.py
  56. 7 0
      src/llama_recipes/utils/__init__.py
  57. 3 3
      utils/config_utils.py
  58. 3 4
      utils/dataset_utils.py
  59. 0 3
      utils/fsdp_utils.py
  60. 81 0
      src/llama_recipes/utils/memory_utils.py
  61. 29 42
      utils/train_utils.py
  62. 72 0
      tests/test_finetuning.py
  63. 48 0
      tests/test_train_utils.py
  64. 0 7
      utils/__init__.py

+ 29 - 1
CONTRIBUTING.md

@@ -28,4 +28,32 @@ outlined on that page and do not file a public issue.
 
 ## License
 By contributing to llama-recipes, you agree that your contributions will be licensed
-under the LICENSE file in the root directory of this source tree.
+under the LICENSE file in the root directory of this source tree.
+
+## Tests
+Llama-recipes currently comes with a basic set of unit tests (covering the parts of the main training script and training loop) but we strive to increase our test coverage in the future in order to mitigate silent errors.
+When submitting a new feature PR please make sure to cover the newly added code with a unit test.
+Run the tests locally to ensure the new feature does not break an old one.
+We use **pytest** for our unit tests and to run them locally you need to install llama-recipes with optional [tests] dependencies enabled:
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes[tests]
+```
+For development and contributing to llama-recipes please install from source with all optional dependencies:
+```
+pip install -U pip setuptools
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]
+```
+The unit tests can be found in the [tests](./tests/) folder and you can run them from the main directory using:
+```
+python -m pytest tests/
+```
+To run all tests of a single file you can give the filename directly:
+```
+python -m pytest tests/test_finetuning.py
+```
+To run a specific test you can filter for its name with
+```
+python -m pytest tests/test_finetuning.py -k test_finetuning_peft
+```
+To add a new test simply create a new test file under the tests folder (filename has to start with `test_`).
+Group tests spanning the same feature in the same file and create a subfolder if the tests are very extensive.

+ 54 - 24
README.md

@@ -20,9 +20,45 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
 # Quick Start
 
-[Llama 2 Jupyter Notebook](quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
+[Llama 2 Jupyter Notebook](./examples/quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
 
-**Note** All the setting defined in [config files](./configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
+# Installation
+Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source.
+
+## Install with pip
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes
+```
+## Install from source
+To install from source e.g. for development use this command. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.
+```
+pip install -U pip setuptools
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .
+```
+For development and contributing to llama-recipes please install all optional dependencies:
+```
+pip install -U pip setuptools
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]
+```
+## Install with optional dependencies
+Llama-recipes offers the installation of optional packages. There are three optional dependency groups.
+To run the unit tests we can install the required dependencies with:
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes[tests]
+```
+For the vLLM example we need additional requirements that can be installed with:
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes[vllm]
+```
+To use the sensitive topics safety checker install with:
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes[auditnlg]
+```
+Optional dependencies can also be combines with [option1,option2].
+
+⚠️ **Note** ⚠️  Some features (especially fine-tuning with FSDP + PEFT) currently require PyTorch nightlies to be installed. Please make sure to install the nightlies if you're using these features following [this guide](https://pytorch.org/get-started/locally/).
+
+**Note** All the setting defined in [config files](src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
 
 **Note** In case need to run PEFT model with FSDP, please make sure to use the PyTorch Nightlies.
 
@@ -35,16 +71,9 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 * [Inference](./docs/inference.md)
 * [FAQs](./docs/FAQ.md)
 
-## Requirements
-To run the examples, make sure to install the requirements using
-
-```bash
-# python 3.9 or higher recommended
-pip install -r requirements.txt
-
-```
+# Where to find the models?
 
-**Please note that the above requirements.txt will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
+You can find llama v2 models on HuggingFace hub [here](https://huggingface.co/meta-llama), where models with `hf` in the name are already converted to HuggingFace checkpoints so no further conversion is needed. The conversion step below is only for original model weights from Meta that are hosted on HuggingFace model hub as well.
 
 # Model conversion to Hugging Face
 The recipes and notebooks in this folder are using the Llama 2 model definition provided by Hugging Face's transformers library.
@@ -76,7 +105,7 @@ All the parameters in the examples and recipes below need to be further tuned to
 
 * Default dataset and other LORA config has been set to `samsum_dataset`.
 
-* Make sure to set the right path to the model in the [training config](./configs/training.py).
+* Make sure to set the right path to the model in the [training config](src/llama_recipes/configs/training.py).
 
 ### Single GPU:
 
@@ -84,7 +113,7 @@ All the parameters in the examples and recipes below need to be further tuned to
 #if running on multi-gpu machine
 export CUDA_VISIBLE_DEVICES=0
 
-python llama_finetuning.py  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
 
 ```
 
@@ -92,7 +121,7 @@ Here we make use of Parameter Efficient Methods (PEFT) as described in the next
 
 **Note** if you are running on a machine with multiple GPUs please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`
 
-**Make sure you set [save_model](configs/training.py) in [training.py](configs/training.py) to save the model. Be sure to check the other training settings in [train config](configs/training.py) as well as others in the config folder as needed or they can be passed as args to the training script as well.**
+**Make sure you set `save_model` parameter to save the model. Be sure to check the other training parameter in [train config](src/llama_recipes/configs/training.py) as well as others in the config folder as needed. All parameter can be passed as args to the training script. No need to alter the config files.**
 
 
 ### Multiple GPUs One Node:
@@ -101,7 +130,7 @@ Here we make use of Parameter Efficient Methods (PEFT) as described in the next
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 4  llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model
 
 ```
 
@@ -112,7 +141,7 @@ Here we use FSDP as discussed in the next section which can be used along with P
 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
 
 ```bash
-torchrun --nnodes 1 --nproc_per_node 4  llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model --use_fast_kernels
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model --use_fast_kernels
 ```
 
 ### Fine-tuning using FSDP Only
@@ -121,7 +150,7 @@ If you are interested in running full parameter fine-tuning without making use o
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 8  llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels
+torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels
 
 ```
 
@@ -131,7 +160,7 @@ If you are interested in running full parameter fine-tuning on the 70B model, yo
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
 
 ```
 
@@ -149,20 +178,21 @@ You can read more about our fine-tuning strategies [here](./docs/LLM_finetuning.
 # Repository Organization
 This repository is organized in the following way:
 
-[configs](configs/): Contains the configuration files for PEFT methods, FSDP, Datasets.
+[configs](src/llama_recipes/configs/): Contains the configuration files for PEFT methods, FSDP, Datasets.
 
 [docs](docs/): Example recipes for single and multi-gpu fine-tuning recipes.
 
-[ft_datasets](ft_datasets/): Contains individual scripts for each dataset to download and process. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+[datasets](src/llama_recipes/datasets/): Contains individual scripts for each dataset to download and process. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
 
+[examples](./examples/): Contains examples script for finetuning and inference of the Llama 2 model as well as how to use them safely.
 
-[inference](inference/): Includes examples for inference for the fine-tuned models and how to use them safely.
+[inference](src/llama_recipes/inference/): Includes modules for inference for the fine-tuned models.
 
-[model_checkpointing](model_checkpointing/): Contains FSDP checkpoint handlers.
+[model_checkpointing](src/llama_recipes/model_checkpointing/): Contains FSDP checkpoint handlers.
 
-[policies](policies/): Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode).
+[policies](src/llama_recipes/policies/): Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode).
 
-[utils](utils/): Utility files for:
+[utils](src/llama_recipes/utils/): Utility files for:
 
 - `train_utils.py` provides training/eval loop and more train utils.
 

+ 1 - 1
UPDATES.md

@@ -16,4 +16,4 @@ As noted in the documentation, these strings are required to use the fine-tuned
 ### Updated approach
 We recommend sanitizing [these strings](https://github.com/facebookresearch/llama#fine-tuned-chat-models) from any user provided prompts. Sanitization of user prompts mitigates malicious or accidental abuse of these strings. The provided scripts have been updated to do this. 
 
-Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](https://github.com/facebookresearch/llama-recipes/blob/main/inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository.
+Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](https://github.com/facebookresearch/llama-recipes/blob/main/examples/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository.

+ 3 - 0
dev_requirements.txt

@@ -0,0 +1,3 @@
+vllm
+pytest-mock
+auditnlg

+ 6 - 6
docs/Dataset.md

@@ -1,6 +1,6 @@
 # Datasets and Evaluation Metrics
 
-The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
 
 * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
 * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
@@ -10,18 +10,18 @@ The provided fine tuning script allows you to select between three datasets by p
 
 The list of available datasets can easily be extended with custom datasets by following these instructions.
 
-Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
 
-Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
+Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
 The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
 For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
 
 To add a custom dataset the following steps need to be performed.
 
-1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../configs/datasets.py).
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../src/llama_recipes/configs/datasets.py).
 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
-3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
-4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.
+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../src/llama_recipes/utils/dataset_utils.py)
+4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
 
 ## Application
 Below we list other datasets and their main use cases that can be used for fine tuning.

+ 2 - 2
docs/FAQ.md

@@ -34,8 +34,8 @@ Here we discuss frequently asked questions that may occur and we found useful al
 os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'
 
 ```
-We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../utils/train_utils.py), feel free to uncomment it if required.
+We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
 
 8. Additional debugging flags? the environment variable `TORCH_DISTRIBUTED_DEBUG` can be used to trigger additional useful logging and collective synchronization checks to ensure all ranks are synchronized appropriately. `TORCH_DISTRIBUTED_DEBUG` can be set to either OFF (default), INFO, or DETAIL depending on the debugging level required. Please note that the most verbose option, DETAIL may impact the application performance and thus should only be used when debugging issues.
 
-We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../utils/train_utils.py), feel free to uncomment it if required.
+We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.

+ 34 - 22
docs/inference.md

@@ -1,6 +1,6 @@
 # Inference
 
-For inference we have provided an [inference script](../inference/inference.py). Depending on the type of finetuning performed during training the [inference script](../inference/inference.py) takes different arguments.
+For inference we have provided an [inference script](../examples/inference.py). Depending on the type of finetuning performed during training the [inference script](../examples/inference.py) takes different arguments.
 To finetune all model parameters the output dir of the training has to be given as --model_name argument.
 In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
 Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
@@ -15,15 +15,15 @@ Examples:
 
  ```bash
 # Full finetuning of all parameters
-cat <test_prompt_file> | python inference/inference.py --model_name <training_config.output_dir> --use_auditnlg
+cat <test_prompt_file> | python examples/inference.py --model_name <training_config.output_dir> --use_auditnlg
 # PEFT method
-cat <test_prompt_file> | python inference/inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
+cat <test_prompt_file> | python examples/inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
 # prompt as parameter
-python inference/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
+python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
  ```
-The inference folder contains test prompts for summarization use-case:
+The example folder contains test prompts for summarization use-case:
 ```
-inference/samsum_prompt.txt
+examples/samsum_prompt.txt
 ...
 ```
 
@@ -33,26 +33,26 @@ Currently pad token by default in [HuggingFace Tokenizer is `None`](https://gith
 ```python
 tokenizer.add_special_tokens(
         {
-         
+
             "pad_token": "<PAD>",
         }
     )
-model.resize_token_embeddings(model.config.vocab_size + 1) 
+model.resize_token_embeddings(model.config.vocab_size + 1)
 ```
-Padding would be required for batch inference. In this this [example](../inference/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
+Padding would be required for batch inference. In this this [example](../examples/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
 
 **Chat completion**
 The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
 
 ```bash
-python inference/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file inference/chats.json  --quantization --use_auditnlg
+python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg
 
 ```
 **Code Llama**
 
 Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
 
-Find the scripts to run Code Llama [here](../inference/code-llama/), where there are two examples of running code completion and infilling.
+Find the scripts to run Code Llama [here](../examples/code_llama/), where there are two examples of running code completion and infilling.
 
 **Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
 
@@ -68,7 +68,7 @@ To run the code completion example:
 
 ```bash
 
-python code_completion_example.py --model_name MODEL_NAME  --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9
+python examples/code_llama/code_completion_example.py --model_name MODEL_NAME  --prompt_file examples/code_llama/code_completion_prompt.txt --temperature 0.2 --top_p 0.9
 
 ```
 
@@ -76,7 +76,7 @@ To run the code infilling example:
 
 ```bash
 
-python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
+python examples/code_llama/code_infilling_example.py --model_name MODEL_NAME --prompt_file examples/code_llama/code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
 
 ```
 
@@ -85,25 +85,25 @@ python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infi
 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
 
 ```bash
-python inference/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file inference/chats.json  --quantization --use_auditnlg --use_fast_kernels
+python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
 
-python inference/inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
+python examples/inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
 
 ```
 
 ## Loading back FSDP checkpoints
 
-In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
+In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
 **To convert the checkpoint use the following command**:
 
 This is helpful if you have fine-tuned you model using FSDP only as follows:
 
 ```bash
-torchrun --nnodes 1 --nproc_per_node 8  llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 
+torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
 ```
 Then convert your FSDP checkpoint to HuggingFace checkpoints using:
 ```bash
- python inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
+ python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
 
  # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
  ```
@@ -112,10 +112,22 @@ By default, training parameter are saved in `train_params.yaml` in the path wher
 Then run inference using:
 
 ```bash
-python inference/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
+python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
 
 ```
 
+## Prompt Llama 2
+
+As outlined by [this blog by Hugging Face](https://huggingface.co/blog/llama2#how-to-prompt-llama-2), you can use the template below to prompt Llama 2 chat models. Review the [blog article](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) for more information.
+
+```
+<s>[INST] <<SYS>>
+{{ system_prompt }}
+<</SYS>>
+
+{{ user_message }} [/INST]
+
+```
 
 ## Other Inference Options
 
@@ -123,12 +135,12 @@ Alternate inference options include:
 
 [**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
 To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
-Once installed, you can use the vLLM_ineference.py script provided [here](../inference/vLLM_inference.py).
+Once installed, you can use the vllm/inference.py script provided [here](../examples/vllm/inference.py).
 
 Below is an example of how to run the vLLM_inference.py script found within the inference folder.
 
 ``` bash
-python vLLM_inference.py --model_name <PATH/TO/MODEL/7B>
+python examples/vllm/inference.py --model_name <PATH/TO/MODEL/7B>
 ```
 
-[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../inference/hf-text-generation-inference/README.md).
+[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../examples/hf_text_generation_inference/README.md).

+ 19 - 25
docs/multi_gpu.md

@@ -9,15 +9,9 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
 
 ## Requirements 
-To run the examples, make sure to install the requirements using 
+To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`examples/finetuning.py`](../examples/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
 
-```bash
-
-pip install -r requirements.txt
-
-```
-
-**Please note that the above requirements.txt will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
+**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
 
 ## How to run it
 
@@ -30,7 +24,7 @@ This runs with the `samsum_dataset` for summarization application by default.
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 4  ../llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
 
 ```
 
@@ -49,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
 
 ```bash
-torchrun --nnodes 1 --nproc_per_node 4  ../llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
 ```
 
 ### Fine-tuning using FSDP Only
@@ -58,7 +52,7 @@ If interested in running full parameter finetuning without making use of PEFT me
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 8  llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
+torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
 
 ```
 
@@ -68,7 +62,7 @@ If you are interested in running full parameter fine-tuning on the 70B model, yo
 
 ```bash
 
-torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
 
 ```
 
@@ -78,21 +72,21 @@ Here we use a slurm script to schedule a job with slurm over multiple nodes.
 
 ```bash
 
-sbatch multi_node.slurm
+sbatch examples/multi_node.slurm
 # Change the num nodes and GPU per nodes in the script before running.
 
 ```
 
 ## How to run with different datasets?
 
-Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
+Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py).
 
-* `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
+* `grammar_dataset` : use this [notebook](../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
 
-* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `ft_dataset` folder.
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
 
 ```bash
-wget -P ft_datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+wget -P src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
 ```
 
 * `samsum_dataset`
@@ -101,22 +95,22 @@ To run with each of the datasets set the `dataset` flag in the command as shown
 
 ```bash
 # grammer_dataset
-torchrun --nnodes 1 --nproc_per_node 4  ../llama_finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
 
 # alpaca_dataset
 
-torchrun --nnodes 1 --nproc_per_node 4  ../llama_finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
 
 
 # samsum_dataset
 
-torchrun --nnodes 1 --nproc_per_node 4  ../llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
 
 ```
 
 ## Where to configure settings?
 
-* [Training config file](../configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../configs/)
+* [Training config file](../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../src/llama_recipes/configs/)
 
 It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
 
@@ -126,6 +120,7 @@ model_name: str="PATH/to/LLAMA 2/7B"
 enable_fsdp: bool= False
 run_validation: bool=True
 batch_size_training: int=4
+gradient_accumulation_steps: int=1
 num_epochs: int=3
 num_workers_dataloader: int=2
 lr: float=2e-4
@@ -135,7 +130,6 @@ use_fp16: bool=False
 mixed_precision: bool=True
 val_batch_size: int=4
 dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
-micro_batch_size: int=1
 peft_method: str = "lora" # None , llama_adapter, prefix
 use_peft: bool=False
 output_dir: str = "./ft-output"
@@ -149,11 +143,11 @@ save_optimizer: bool=False
 
 ```
 
-* [Datasets config file](../configs/datasets.py) provides the available options for datasets.
+* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
 
-* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
+* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
 
-* [FSDP config file](../configs/fsdp.py) provides FSDP settings such as:
+* [FSDP config file](../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
 
     * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
 

+ 15 - 21
docs/single_gpu.md

@@ -4,29 +4,23 @@ To run fine-tuning on a single GPU, we will  make use of two packages
 
 1- [PEFT](https://huggingface.co/blog/peft) methods and in specific using HuggingFace [PEFT](https://github.com/huggingface/peft)library.
 
-2- [BitandBytes](https://github.com/TimDettmers/bitsandbytes) int8 quantization.
+2- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) int8 quantization.
 
 Given combination of PEFT and Int8 quantization, we would be able to fine_tune a Llama 2 7B model on one consumer grade GPU such as A10.
 
 ## Requirements 
-To run the examples, make sure to install the requirements using 
+To run the examples, make sure to install the llama-recipes package (See [README.md](../README.md) for details).
 
-```bash
-
-pip install -r requirements.txt
-
-```
-
-**Please note that the above requirements.txt will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
+**Please note that the llama-recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
 
 ## How to run it?
 
-Get access to a machine with one GPU or if using a multi-GPU macine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id` and run the following. It runs by default with `samsum_dataset` for summarization application.
+Get access to a machine with one GPU or if using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id` and run the following. It runs by default with `samsum_dataset` for summarization application.
 
 
 ```bash
 
-python ../llama_finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
 
 ```
 The args used in the command above are:
@@ -40,14 +34,14 @@ The args used in the command above are:
 
 ## How to run with different datasets?
 
-Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
+Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py).
 
-* `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
+* `grammar_dataset` : use this [notebook](../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
 
 * `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `ft_dataset` folder.
 
 ```bash
-wget -P ft_datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+wget -P src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
 ```
 
 * `samsum_dataset`
@@ -57,22 +51,22 @@ to run with each of the datasets set the `dataset` flag in the command as shown
 ```bash
 # grammer_dataset
 
-python ../llama_finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
 
 # alpaca_dataset
 
-python ../llama_finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
 
 
 # samsum_dataset
 
-python ../llama_finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
 
 ```
 
 ## Where to configure settings?
 
-* [Training config file](../configs/training.py) is the main config file that help to specify the settings for our run can be found in
+* [Training config file](../src/llama_recipes/configs/training.py) is the main config file that help to specify the settings for our run can be found in
 
 It let us specify the training settings, everything from `model_name` to `dataset_name`, `batch_size` etc. can be set here. Below is the list of supported settings:
 
@@ -82,6 +76,7 @@ model_name: str="PATH/to/LLAMA 2/7B"
 enable_fsdp: bool= False
 run_validation: bool=True
 batch_size_training: int=4
+gradient_accumulation_steps: int=1
 num_epochs: int=3
 num_workers_dataloader: int=2
 lr: float=2e-4
@@ -91,7 +86,6 @@ use_fp16: bool=False
 mixed_precision: bool=True
 val_batch_size: int=4
 dataset = "samsum_dataset" # alpaca_dataset,grammar_dataset
-micro_batch_size: int=1
 peft_method: str = "lora" # None , llama_adapter, prefix
 use_peft: bool=False
 output_dir: str = "./ft-output"
@@ -106,6 +100,6 @@ save_optimizer: bool=False
 
 ```
 
-* [Datasets config file](../configs/datasets.py) provides the available options for datasets.
+* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
 
-* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
+* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.

+ 34 - 0
examples/README.md

@@ -0,0 +1,34 @@
+# Examples
+
+This folder contains finetuning and inference examples for Llama 2.
+For the full documentation on these examples please refer to [docs/inference.md](../docs/inference.md)
+
+## Finetuning
+
+Please refer to the main [README.md](../README.md) for information on how to use the [finetuning.py](./finetuning.py) script.
+After installing the llama-recipes package through [pip](../README.md#installation) you can also invoke the finetuning in two ways:
+```
+python -m llama_recipes.finetuning <parameters>
+
+python examnples/finetuning.py <parameters>
+```
+Please see [README.md](../README.md) for details.
+
+## Inference 
+So far, we have provide the following inference examples:
+
+1. [inference script](./inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models. It also demonstrates safety features to protect the user from toxic or harmful content.
+
+2. [vllm/inference.py](./vllm/inference.py) script takes advantage of vLLM's paged attention concept for low latency.
+
+3. The [hf_text_generation_inference](./hf_text_generation_inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
+
+4. A [chat completion](./chat_completion/chat_completion.py) example highlighting the handling of chat dialogs.
+
+5. [Code Llama](./code_llama/) folder which provides examples for [code completion](./code_llama/code_completion_example.py) and [code infilling](./code_llama/code_infilling_example.py).
+
+For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
+
+**Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
+
+**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.

+ 9 - 10
inference/chat_completion.py

@@ -2,18 +2,17 @@
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
 # from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+
 import fire
-import torch
 import os
 import sys
-import warnings
-from typing import List
-
-from peft import PeftModel, PeftConfig
-from transformers import LlamaConfig, LlamaTokenizer, LlamaForCausalLM
-from safety_utils import get_safety_checker
-from model_utils import load_model, load_peft_model
-from chat_utils import read_dialogs_from_file, format_tokens
+
+import torch
+from transformers import LlamaTokenizer
+
+from llama_recipes.inference.chat_utils import read_dialogs_from_file, format_tokens
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+from llama_recipes.inference.safety_utils import get_safety_checker
 from accelerate.utils import is_xpu_available
 
 def main(
@@ -114,7 +113,7 @@ def main(
             else:
                 tokens= tokens.to("cuda:0")
             outputs = model.generate(
-                tokens,
+                input_ids=tokens,
                 max_new_tokens=max_new_tokens,
                 do_sample=do_sample,
                 top_p=top_p,

inference/chats.json → examples/chat_completion/chats.json


+ 5 - 5
inference/code-llama/code_completion_example.py

@@ -4,16 +4,16 @@
 # from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 
 import fire
-import torch
 import os
 import sys
 import time
-from typing import List
 
+import torch
 from transformers import AutoTokenizer
-sys.path.append("..")
-from safety_utils import get_safety_checker
-from model_utils import load_model, load_peft_model, load_llama_from_config
+
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+
 
 def main(
     model_name,

inference/code-llama/code_completion_prompt.txt → examples/code_llama/code_completion_prompt.txt


+ 3 - 4
inference/code-llama/code_infilling_example.py

@@ -8,12 +8,11 @@ import torch
 import os
 import sys
 import time
-from typing import List
 
 from transformers import AutoTokenizer
-sys.path.append("..")
-from safety_utils import get_safety_checker
-from model_utils import load_model, load_peft_model, load_llama_from_config
+
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
 
 def main(
     model_name,

inference/code-llama/code_infilling_prompt.txt → examples/code_llama/code_infilling_prompt.txt


+ 5 - 3
configs/__init__.py

@@ -1,6 +1,8 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from .peft import lora_config, llama_adapter_config, prefix_config
-from .fsdp import fsdp_config
-from .training import train_config
+import fire
+from llama_recipes.finetuning import main
+
+if __name__ == "__main__":
+    fire.Fire(main)

+ 3 - 3
inference/hf-text-generation-inference/README.md

@@ -1,6 +1,6 @@
-# Serving a fine tuned LLaMA model with HuggingFace text-generation-inference server
+# Serving a fine tuned Llama model with HuggingFace text-generation-inference server
 
-This document shows how to serve a fine tuned LLaMA mode with HuggingFace's text-generation-inference server. This option is currently only available for models that were trained using the LoRA method or without using the `--use_peft` argument.
+This document shows how to serve a fine tuned Llama mode with HuggingFace's text-generation-inference server. This option is currently only available for models that were trained using the LoRA method or without using the `--use_peft` argument.
 
 ## Step 0: Merging the weights (Only required if LoRA method was used) 
 
@@ -9,7 +9,7 @@ In case the model was fine tuned with LoRA method we need to merge the weights o
 The script takes the base model, the peft weight folder as well as an output as arguments:
 
 ```
-python inference/hf-text-generation-inference/merge_lora_weights.py --base_model llama-7B --peft_model ft_output --output_dir data/merged_model_output
+python -m llama_recipes.inference.hf_text_generation_inference.merge_lora_weights --base_model llama-7B --peft_model ft_output --output_dir data/merged_model_output
 ```
 
 ## Step 1: Serving the model

inference/hf-text-generation-inference/merge_lora_weights.py → examples/hf_text_generation_inference/merge_lora_weights.py


+ 6 - 5
inference/inference.py

@@ -4,15 +4,16 @@
 # from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 
 import fire
-import torch
 import os
 import sys
 import time
-from typing import List
 
+import torch
 from transformers import LlamaTokenizer
-from safety_utils import get_safety_checker
-from model_utils import load_model, load_peft_model, load_llama_from_config
+
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+
 from accelerate.utils import is_xpu_available
 
 def main(
@@ -108,7 +109,7 @@ def main(
         batch = {k: v.to("xpu") for k, v in batch.items()}
     else:
         batch = {k: v.to("cuda") for k, v in batch.items()}
-   
+
     start = time.perf_counter()
     with torch.no_grad():
         outputs = model.generate(

+ 1 - 1
multi_node.slurm

@@ -32,5 +32,5 @@ export CUDA_LAUNCH_BLOCKING=0
 export NCCL_SOCKET_IFNAME="ens"
 export FI_EFA_USE_DEVICE_RDMA=1
 
-srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 llama_finetuning.py  --enable_fsdp --use_peft --peft_method lora
+srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 examples/finetuning.py  --enable_fsdp --use_peft --peft_method lora
 

quickstart.ipynb → examples/quickstart.ipynb


inference/samsum_prompt.txt → examples/samsum_prompt.txt


+ 2 - 9
inference/vLLM_inference.py

@@ -1,17 +1,9 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 import fire
+
 import torch
-import os
-import sys
-from peft import PeftModel, PeftConfig
-from transformers import (
-    LlamaConfig,
-    LlamaTokenizer,
-    LlamaForCausalLM
-)
 from vllm import LLM
 from vllm import LLM, SamplingParams
 from accelerate.utils import is_xpu_available
@@ -20,6 +12,7 @@ if is_xpu_available():
     torch.xpu.manual_seed(42)
 else:
     torch.cuda.manual_seed(42)
+
 torch.manual_seed(42)
 
 def load_model(model_name, tp_size=1):

+ 0 - 6
ft_datasets/__init__.py

@@ -1,6 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-from .grammar_dataset import get_dataset as get_grammar_dataset
-from .alpaca_dataset import InstructionDataset as get_alpaca_dataset
-from .samsum_dataset import get_preprocessed_samsum as get_samsum_dataset

+ 0 - 11
inference/README.md

@@ -1,11 +0,0 @@
-# Inference
-
-This folder contains inference examples for Llama 2. So far, we have provided support for three methods of inference:
-
-1. [inference script](inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models.
-
-2. [vLLM_inference.py](vLLM_inference.py) script takes advantage of vLLM's paged attention concept for low latency.
-
-3. The [hf-text-generation-inference](hf-text-generation-inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
-
-For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).

+ 0 - 7
policies/__init__.py

@@ -1,7 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-from .mixed_precision import *
-from .wrapping import *
-from .activation_checkpointing_functions import apply_fsdp_checkpointing
-from .anyprecision_optimizer import AnyPrecisionAdamW

+ 41 - 0
pyproject.toml

@@ -0,0 +1,41 @@
+[build-system]
+requires = ["hatchling", "hatch-requirements-txt"]
+build-backend = "hatchling.build"
+
+[project]
+name = "llama-recipes"
+version = "0.0.1"
+authors = [
+  { name="Hamid Shojanazeri", email="hamidnazeri@meta.com" },
+  { name="Matthias Reso", email="mreso@meta.com" },
+  { name="Geeta Chauhan", email="gchauhan@meta.com" },
+]
+description = "Llama-recipes is a companion project to the Llama 2 model. It's goal is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. "
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: Other/Proprietary License",
+    "Operating System :: OS Independent",
+]
+dynamic = ["dependencies"]
+
+[project.optional-dependencies]
+vllm = ["vllm"]
+tests = ["pytest-mock"]
+auditnlg = ["auditnlg"]
+
+[project.urls]
+"Homepage" = "https://github.com/facebookresearch/llama-recipes/"
+"Bug Tracker" = "https://github.com/facebookresearch/llama-recipes/issues"
+
+[tool.hatch.build]
+exclude = [
+  "dist/*",
+]
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/llama_recipes"]
+
+[tool.hatch.metadata.hooks.requirements_txt]
+files = ["requirements.txt"]

+ 4 - 5
requirements.txt

@@ -1,16 +1,15 @@
--f https://download.pytorch.org/whl/torch_stable.html 
-torch==2.0.1+cu118
+torch>=2.0.1
 accelerate
 appdirs
 loralib
-bitsandbytes==0.39.1
+bitsandbytes
 black
 black[jupyter]
 datasets
 fire
-git+https://github.com/huggingface/peft.git
+peft
 transformers>=4.31.0
 sentencepiece
 py7zr
 scipy
-optimum
+optimum

+ 27 - 0
scripts/spellcheck_conf/wordlist.txt

@@ -1076,6 +1076,7 @@ lora
 peft
 samsum
 vLLM
+vllm
 TGI
 vLLM
 vLLM's
@@ -1121,3 +1122,29 @@ summarization
 xA
 Sanitization
 tokenization
+hatchling
+setuptools
+BoolQ
+CausalLM
+Dyck
+GSM
+HellaSwag
+HumanEval
+MMLU
+NarrativeQA
+NaturalQuestions
+OpenbookQA
+PREPROC
+QuAC
+TruthfulQA
+WinoGender
+bAbI
+dataclass
+datafiles
+davinci
+GPU's
+HuggingFace's
+LoRA
+bitsandbytes
+CLA
+dialogs

+ 6 - 0
src/llama_recipes/configs/__init__.py

@@ -0,0 +1,6 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+from llama_recipes.configs.peft import lora_config, llama_adapter_config, prefix_config
+from llama_recipes.configs.fsdp import fsdp_config
+from llama_recipes.configs.training import train_config

+ 3 - 3
configs/datasets.py

@@ -15,8 +15,8 @@ class samsum_dataset:
 @dataclass
 class grammar_dataset:
     dataset: str = "grammar_dataset"
-    train_split: str = "ft_datasets/grammar_dataset/gtrain_10k.csv" 
-    test_split: str = "ft_datasets/grammar_dataset/grammar_validation.csv"
+    train_split: str = "src/llama_recipes/datasets/grammar_dataset/gtrain_10k.csv" 
+    test_split: str = "src/llama_recipes/datasets/grammar_dataset/grammar_validation.csv"
     input_length: int = 2048
 
     
@@ -25,4 +25,4 @@ class alpaca_dataset:
     dataset: str = "alpaca_dataset"
     train_split: str = "train"
     test_split: str = "val"
-    data_path: str = "ft_datasets/alpaca_data.json"
+    data_path: str = "src/llama_recipes/datasets/alpaca_data.json"

+ 2 - 2
configs/fsdp.py

@@ -1,8 +1,8 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from dataclasses import dataclass, field
-from typing import ClassVar
+from dataclasses import dataclass
+
 from torch.distributed.fsdp import ShardingStrategy
 from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
 

+ 1 - 1
configs/peft.py

@@ -1,7 +1,7 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from typing import ClassVar, List
 
 @dataclass

+ 2 - 2
configs/training.py

@@ -1,7 +1,7 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
 from dataclasses import dataclass
-from typing import ClassVar
 
 
 @dataclass
@@ -11,6 +11,7 @@ class train_config:
     low_cpu_fsdp: bool=False
     run_validation: bool=True
     batch_size_training: int=4
+    gradient_accumulation_steps: int=1
     num_epochs: int=3
     num_workers_dataloader: int=1
     lr: float=1e-4
@@ -21,7 +22,6 @@ class train_config:
     mixed_precision: bool=True
     val_batch_size: int=1
     dataset = "samsum_dataset"
-    micro_batch_size: int=4
     peft_method: str = "lora" # None , llama_adapter, prefix
     use_peft: bool=False
     output_dir: str = "PATH/to/save/PEFT/model"

+ 6 - 0
src/llama_recipes/datasets/__init__.py

@@ -0,0 +1,6 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+from llama_recipes.datasets.grammar_dataset.grammar_dataset import get_dataset as get_grammar_dataset
+from llama_recipes.datasets.alpaca_dataset import InstructionDataset as get_alpaca_dataset
+from llama_recipes.datasets.samsum_dataset import get_preprocessed_samsum as get_samsum_dataset

+ 2 - 4
ft_datasets/alpaca_dataset.py

@@ -5,12 +5,10 @@
 
 import copy
 import json
-import os
-import torch
 
-from sentencepiece import SentencePieceProcessor
+import torch
 from torch.utils.data import Dataset
-from typing import List
+
 
 PROMPT_DICT = {
     "prompt_input": (

+ 0 - 1
ft_datasets/grammar_dataset/__init__.py

@@ -1,4 +1,3 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from .grammar_dataset import get_dataset

+ 2 - 18
ft_datasets/grammar_dataset/grammar_dataset.py

@@ -4,29 +4,13 @@
 # For dataset details visit: https://huggingface.co/datasets/jfleg
 # For download and preparation see: recipes/ft_datasets/grammar_dataset/grammar_dataset_process.ipynb
 
-import argparse
-import csv
-import glob
-import os
-import json
-import time
-import logging
-import random
-import re
-from itertools import chain
-from string import punctuation
-
-
-import pandas as pd
-import numpy as np
-import torch
-from torch.utils.data import Dataset
 
 from datasets import load_dataset
 from pathlib import Path
 
-from ft_datasets.utils import ConcatDataset
+from torch.utils.data import Dataset
 
+from llama_recipes.datasets.utils import ConcatDataset
 
 
 class grammar(Dataset):

ft_datasets/grammar_dataset/grammar_dataset_process.ipynb → src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb


+ 2 - 1
ft_datasets/samsum_dataset.py

@@ -4,7 +4,8 @@
 # For dataset details visit: https://huggingface.co/datasets/samsum
 
 import datasets
-from .utils import Concatenator
+
+from llama_recipes.datasets.utils import Concatenator
 
 def get_preprocessed_samsum(dataset_config, tokenizer, split):
     dataset = datasets.load_dataset("samsum", split=split)

+ 1 - 0
ft_datasets/utils.py

@@ -3,6 +3,7 @@
 
 from tqdm import tqdm
 from itertools import chain
+
 from torch.utils.data import Dataset
 
 class Concatenator(object):

+ 17 - 16
llama_finetuning.py

@@ -2,13 +2,13 @@
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
 import os
+from pkg_resources import packaging
 
 import fire
 import torch
 import torch.distributed as dist
 import torch.optim as optim
 from peft import get_peft_model, prepare_model_for_int8_training
-from pkg_resources import packaging
 from torch.distributed.fsdp import (
     FullyShardedDataParallel as FSDP,
 )
@@ -22,19 +22,18 @@ from transformers import (
 )
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer
 
-import policies
-from configs import fsdp_config, train_config
-from policies import AnyPrecisionAdamW
+from llama_recipes.configs import fsdp_config, train_config
+from llama_recipes.policies import AnyPrecisionAdamW, apply_fsdp_checkpointing
 
-from utils import fsdp_auto_wrap_policy
-from utils.config_utils import (
+from llama_recipes.utils import fsdp_auto_wrap_policy
+from llama_recipes.utils.config_utils import (
     update_config,
     generate_peft_config,
     generate_dataset_config,
 )
-from utils.dataset_utils import get_preprocessed_dataset
+from llama_recipes.utils.dataset_utils import get_preprocessed_dataset
 
-from utils.train_utils import (
+from llama_recipes.utils.train_utils import (
     train,
     freeze_transformer_layers,
     setup,
@@ -65,16 +64,14 @@ def main(**kwargs):
 
     if torch.distributed.is_initialized():
         if is_xpu_available():
-            torch.xpu.set_device(rank)
+            torch.xpu.set_device(local_rank)
         else:
-            torch.cuda.set_device(rank)
-        clear_gpu_cache(rank)
+            torch.cuda.set_device(local_rank)
+        clear_gpu_cache(local_rank)
         setup_environ_flags(rank)
 
-    # Calculate gradient accumulation steps
-    gradient_accumulation_steps = train_config.batch_size_training // train_config.micro_batch_size
-
     # Load the pre-trained model and setup its configuration
+    use_cache = False if train_config.enable_fsdp else None
     if train_config.enable_fsdp and train_config.low_cpu_fsdp:
         """
         for FSDP, we can save cpu memory by loading pretrained model on rank0 only.
@@ -92,9 +89,11 @@ def main(**kwargs):
                 train_config.model_name,
                 load_in_8bit=True if train_config.quantization else None,
                 device_map="auto" if train_config.quantization else None,
+                use_cache=use_cache,
             )
         else:
             llama_config = LlamaConfig.from_pretrained(train_config.model_name)
+            llama_config.use_cache = use_cache
             with torch.device("meta"):
                 model = LlamaForCausalLM(llama_config)
 
@@ -103,6 +102,7 @@ def main(**kwargs):
             train_config.model_name,
             load_in_8bit=True if train_config.quantization else None,
             device_map="auto" if train_config.quantization else None,
+            use_cache=use_cache,
         )
     if train_config.enable_fsdp and train_config.use_fast_kernels:
         """
@@ -159,7 +159,7 @@ def main(**kwargs):
             if train_config.low_cpu_fsdp and rank != 0 else None,
         )
         if fsdp_config.fsdp_activation_checkpointing:
-            policies.apply_fsdp_checkpointing(model)
+            apply_fsdp_checkpointing(model)
     elif not train_config.quantization and not train_config.enable_fsdp:
         if is_xpu_available():
             model.to("xpu:0")
@@ -213,6 +213,7 @@ def main(**kwargs):
         collate_fn=default_data_collator,
     )
 
+    eval_dataloader = None
     if train_config.run_validation:
         eval_dataloader = torch.utils.data.DataLoader(
             dataset_val,
@@ -249,7 +250,7 @@ def main(**kwargs):
         tokenizer,
         optimizer,
         scheduler,
-        gradient_accumulation_steps,
+        train_config.gradient_accumulation_steps,
         train_config,
         fsdp_config if train_config.enable_fsdp else None,
         local_rank if train_config.enable_fsdp else None,

+ 2 - 0
src/llama_recipes/inference/__init__.py

@@ -0,0 +1,2 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

+ 2 - 1
inference/chat_utils.py

@@ -1,8 +1,9 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from typing import List, Literal, Optional, Tuple, TypedDict, Union
 import json
+from typing import List, Literal, TypedDict
+
 
 Role = Literal["user", "assistant"]
 

+ 4 - 2
inference/checkpoint_converter_fsdp_hf.py

@@ -4,12 +4,14 @@
 # from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 
 import fire
-import torch
 import os
 import sys
 import yaml
+
 from transformers import LlamaTokenizer
-from model_utils import  load_llama_from_config
+
+from llama_recipes.inference.model_utils import  load_llama_from_config
+
 # Get the current file's directory
 current_directory = os.path.dirname(os.path.abspath(__file__))
 

inference/model_utils.py → src/llama_recipes/inference/model_utils.py


+ 0 - 2
inference/safety_utils.py

@@ -5,8 +5,6 @@ import os
 import torch
 import warnings
 
-from peft import PeftConfig
-from transformers import LlamaConfig, LlamaTokenizer, LlamaForCausalLM
 
 # Class for performing safety checks using AuditNLG library
 class AuditNLGSensitiveTopics(object):

+ 1 - 1
model_checkpointing/__init__.py

@@ -1,7 +1,7 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-from .checkpoint_handler import (
+from llama_recipes.model_checkpointing.checkpoint_handler import (
     load_model_checkpoint,
     save_model_checkpoint,
     load_optimizer_checkpoint,

model_checkpointing/checkpoint_handler.py → src/llama_recipes/model_checkpointing/checkpoint_handler.py


+ 7 - 0
src/llama_recipes/policies/__init__.py

@@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+from llama_recipes.policies.mixed_precision import *
+from llama_recipes.policies.wrapping import *
+from llama_recipes.policies.activation_checkpointing_functions import apply_fsdp_checkpointing
+from llama_recipes.policies.anyprecision_optimizer import AnyPrecisionAdamW

+ 2 - 6
policies/activation_checkpointing_functions.py

@@ -1,18 +1,14 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-import torch
-import os
-import torch.distributed as dist
+from functools import partial
+
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
     checkpoint_wrapper,
     CheckpointImpl,
     apply_activation_checkpointing,
 )
-
-from transformers.models.t5.modeling_t5 import T5Block
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer
-from functools import partial
 
 non_reentrant_wrapper = partial(
     checkpoint_wrapper,

policies/anyprecision_optimizer.py → src/llama_recipes/policies/anyprecision_optimizer.py


+ 0 - 4
policies/mixed_precision.py

@@ -4,11 +4,7 @@
 import torch
 
 from torch.distributed.fsdp import (
-    # FullyShardedDataParallel as FSDP,
-    # CPUOffload,
     MixedPrecision,
-    # BackwardPrefetch,
-    # ShardingStrategy,
 )
 
 # requires grad scaler in main loop

+ 1 - 15
policies/wrapping.py

@@ -1,28 +1,14 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-import torch.distributed as dist
-import torch.nn as nn
-import torch
+import functools
 
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer
-
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    FullyShardedDataParallel as FSDP,
-    CPUOffload,
-    BackwardPrefetch,
-    MixedPrecision,
-)
 from torch.distributed.fsdp.wrap import (
     transformer_auto_wrap_policy,
     size_based_auto_wrap_policy,
-    enable_wrap,
-    wrap,
 )
 
-import functools
-from typing import Type
-
 
 def get_size_policy(min_params=1e8):
     num_wrap_policy = functools.partial(

+ 7 - 0
src/llama_recipes/utils/__init__.py

@@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+from llama_recipes.utils.memory_utils import MemoryTrace
+from llama_recipes.utils.dataset_utils import *
+from llama_recipes.utils.fsdp_utils import fsdp_auto_wrap_policy
+from llama_recipes.utils.train_utils import *

+ 3 - 3
utils/config_utils.py

@@ -3,15 +3,15 @@
 
 import inspect
 from dataclasses import fields
+
 from peft import (
     LoraConfig,
     AdaptionPromptConfig,
     PrefixTuningConfig,
 )
 
-import configs.datasets as datasets
-from configs import lora_config, llama_adapter_config, prefix_config, train_config
-from .dataset_utils import DATASET_PREPROC
+from llama_recipes.configs import datasets, lora_config, llama_adapter_config, prefix_config, train_config
+from llama_recipes.utils.dataset_utils import DATASET_PREPROC
 
 
 def update_config(config, **kwargs):

+ 3 - 4
utils/dataset_utils.py

@@ -1,16 +1,15 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-import torch
-
 from functools import partial
 
-from ft_datasets import (
+import torch
+
+from llama_recipes.datasets import (
     get_grammar_dataset,
     get_alpaca_dataset,
     get_samsum_dataset,
 )
-from typing import Optional
 
 
 DATASET_PREPROC = {

+ 0 - 3
utils/fsdp_utils.py

@@ -3,10 +3,7 @@
 
 def fsdp_auto_wrap_policy(model, transformer_layer_name):
     import functools
-    import os
 
-    from accelerate import FullyShardedDataParallelPlugin
-    from transformers.models.t5.modeling_t5 import T5Block
     from torch.distributed.fsdp.wrap import _or_policy, lambda_auto_wrap_policy, transformer_auto_wrap_policy
 
     from peft.tuners import PrefixEncoder, PromptEmbedding, PromptEncoder

+ 81 - 0
src/llama_recipes/utils/memory_utils.py

@@ -0,0 +1,81 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import gc
+import psutil
+import threading
+
+import torch
+from accelerate.utils import is_xpu_available
+
+def byte2gb(x):
+    return int(x / 2**30)
+# This context manager is used to track the peak memory usage of the process
+class MemoryTrace:
+    def __enter__(self):
+        gc.collect()
+        if is_xpu_available():
+            torch.xpu.empty_cache()
+            torch.xpu.reset_max_memory_allocated()   # reset the peak gauge to zero
+            self.begin = byte2gb(torch.xpu.memory_allocated())
+        elif torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.reset_max_memory_allocated()  # reset the peak gauge to zero
+            self.begin = byte2gb(torch.cuda.memory_allocated())
+        self.process = psutil.Process()
+        self.cpu_begin = byte2gb(self.cpu_mem_used())
+        self.peak_monitoring = True
+        peak_monitor_thread = threading.Thread(target=self.peak_monitor_func)
+        peak_monitor_thread.daemon = True
+        peak_monitor_thread.start()
+        return self
+
+    def cpu_mem_used(self):
+        """get resident set size memory for the current process"""
+        return self.process.memory_info().rss
+
+    def peak_monitor_func(self):
+        self.cpu_peak = -1
+
+        while True:
+            self.cpu_peak = max(self.cpu_mem_used(), self.cpu_peak)
+
+            # can't sleep or will not catch the peak right (this comment is here on purpose)
+            # time.sleep(0.001) # 1msec
+
+            if not self.peak_monitoring:
+                break
+
+    def __exit__(self, *exc):
+        self.peak_monitoring = False
+
+        gc.collect()
+        if is_xpu_available():
+            torch.xpu.empty_cache()
+            self.end = byte2gb(torch.xpu.memory_allocated())
+            self.peak = byte2gb(torch.xpu.max_memory_allocated())
+            xpu_info = torch.xpu.memory_stats()
+            self.peak_active_gb = byte2gb(xpu_info["active_bytes.all.peak"])
+            self.xpu_malloc_retires = xpu_info.get("num_alloc_retries", 0)
+            self.peak_active_gb = byte2gb(xpu_info["active_bytes.all.peak"])
+            self.m_xpu_ooms = xpu_info.get("num_ooms", 0)
+            self.used = byte2gb(self.end - self.begin)
+            self.peaked = byte2gb(self.peak - self.begin)
+            self.max_reserved = byte2gb(torch.xpu.max_memory_reserved())
+        else:
+            torch.cuda.empty_cache()
+            self.end = byte2gb(torch.cuda.memory_allocated())
+            self.peak = byte2gb(torch.cuda.max_memory_allocated())
+            cuda_info = torch.cuda.memory_stats()
+            self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
+            self.cuda_malloc_retires = cuda_info.get("num_alloc_retries", 0)
+            self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
+            self.m_cuda_ooms = cuda_info.get("num_ooms", 0)
+            self.used = byte2gb(self.end - self.begin)
+            self.peaked = byte2gb(self.peak - self.begin)
+            self.max_reserved = byte2gb(torch.cuda.max_memory_reserved())
+
+        self.cpu_end = self.cpu_mem_used()
+        self.cpu_used = byte2gb(self.cpu_end - self.cpu_begin)
+        self.cpu_peaked = byte2gb(self.cpu_peak - self.cpu_begin)
+        # print(f"delta used/peak {self.used:4d}/{self.peaked:4d}")

+ 29 - 42
utils/train_utils.py

@@ -2,40 +2,26 @@
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
 import os
-import sys
-from typing import List
-import yaml
 import time
+import yaml
+from pathlib import Path
+from pkg_resources import packaging
+
 
-import fire
 import torch
-import transformers
-from datasets import load_dataset
-from tqdm import tqdm
-"""
-Unused imports:
-import torch.nn as nn
-import bitsandbytes as bnb
-"""
-from torch.nn import functional as F
-from peft import (
-    LoraConfig,
-    get_peft_model,
-    get_peft_model_state_dict,
-    prepare_model_for_int8_training,
-    set_peft_model_state_dict,
-)
-from transformers import LlamaForCausalLM, LlamaTokenizer
-from torch.distributed.fsdp import StateDictType
-import torch.distributed as dist
-from pkg_resources import packaging
-from .memory_utils import MemoryTrace
-import model_checkpointing
 import torch.cuda.nccl as nccl
+import torch.distributed as dist
+from torch.distributed.fsdp import StateDictType
 from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
-from pathlib import Path
-sys.path.append(str(Path(__file__).resolve().parent.parent))
-from policies import bfSixteen, fpSixteen,bfSixteen_mixed, get_llama_wrapper
+from tqdm import tqdm
+from transformers import LlamaTokenizer
+
+
+from llama_recipes.model_checkpointing import save_model_checkpoint, save_model_and_optimizer_sharded, save_optimizer_checkpoint
+from llama_recipes.policies import fpSixteen,bfSixteen_mixed, get_llama_wrapper
+from llama_recipes.utils.memory_utils import MemoryTrace
+from accelerate.utils import is_xpu_available, is_ccl_available
+
 from accelerate.utils import is_xpu_available, is_ccl_available
 
 def set_tokenizer_params(tokenizer: LlamaTokenizer):
@@ -84,7 +70,9 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
         with MemoryTrace() as memtrace:  # track the memory usage
             model.train()
             total_loss = 0.0
-            for step, batch in enumerate(tqdm(train_dataloader,colour="blue", desc=f"Training Epoch{epoch}")):
+            total_length = len(train_dataloader)//gradient_accumulation_steps
+            pbar = tqdm(colour="blue", desc=f"Training Epoch: {epoch}", total=total_length)
+            for step, batch in enumerate(train_dataloader):
                 for key in batch.keys():
                     if train_config.enable_fsdp:
                         batch[key] = batch[key].to(local_rank)
@@ -103,17 +91,17 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
                         scaler.step(optimizer)
                         scaler.update()
                         optimizer.zero_grad()
+                        pbar.update(step//gradient_accumulation_steps)
                 else:
                     # regular backpropagation when fp16 is not used
                     loss.backward()
                     if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
                         optimizer.step()
                         optimizer.zero_grad()
-                if train_config.enable_fsdp:
-                    if rank==0:       
-                        print(f"\n step {step} is completed and loss is {loss.detach().float()}")
-                else:
-                    print(f"\n step {step} is completed and loss is {loss.detach().float()}")
+                        pbar.update(step//gradient_accumulation_steps)
+                
+                pbar.set_description(f"Training Epoch: {epoch}/{train_config.num_epochs}, step {step}/{len(train_dataloader)} completed (loss: {loss.detach().float()})")
+                
         epoch_end_time = time.perf_counter()-epoch_start_time
         epoch_times.append(epoch_end_time)    
         # Reducing total_loss across all devices if there's more than one CUDA device
@@ -180,21 +168,21 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
                 else:
                     if not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.FULL_STATE_DICT:
                         
-                        model_checkpointing.save_model_checkpoint(
+                        save_model_checkpoint(
                             model, optimizer, rank, train_config, epoch=epoch
                         )
                     elif not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.SHARDED_STATE_DICT:
                         print(" Saving the FSDP model checkpoints using SHARDED_STATE_DICT")
                         print("=====================================================")
                         
-                        model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config)
+                        save_model_and_optimizer_sharded(model, rank, train_config)
                         if train_config.save_optimizer:
-                            model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config, optim=optimizer)
+                            save_model_and_optimizer_sharded(model, rank, train_config, optim=optimizer)
                             print(" Saving the FSDP model checkpoints and optimizer using SHARDED_STATE_DICT")
                             print("=====================================================")
 
                     if not train_config.use_peft and  train_config.save_optimizer:
-                        model_checkpointing.save_optimizer_checkpoint(
+                        save_optimizer_checkpoint(
                             model, optimizer, rank, train_config, epoch=epoch
                         )
                         print(" Saving the FSDP model checkpoints and optimizer using FULL_STATE_DICT")
@@ -212,14 +200,13 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
                     print(f"best eval loss on epoch {epoch} is {best_val_loss}")
             val_loss.append(best_val_loss)
             val_prep.append(eval_ppl)
-        
         if train_config.enable_fsdp:
             if rank==0:
                 print(f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}, epcoh time {epoch_end_time}s")
         else:
             print(f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}, epcoh time {epoch_end_time}s")
-    avg_epoch_time = sum(epoch_times)/ len(epoch_times) 
-    avg_checkpoint_time = sum(checkpoint_times)/ len(checkpoint_times)   
+    avg_epoch_time = sum(epoch_times)/ len(epoch_times)
+    avg_checkpoint_time = sum(checkpoint_times)/ len(checkpoint_times) if len(checkpoint_times) > 0 else 0
     avg_train_prep = sum(train_prep)/len(train_prep)
     avg_train_loss = sum(train_loss)/len(train_loss)
     if train_config.run_validation:

+ 72 - 0
tests/test_finetuning.py

@@ -0,0 +1,72 @@
+from unittest.mock import patch
+import importlib
+
+from torch.utils.data.dataloader import DataLoader
+
+from llama_recipes.finetuning import main
+
+@patch('llama_recipes.finetuning.train')
+@patch('llama_recipes.finetuning.LlamaForCausalLM.from_pretrained')
+@patch('llama_recipes.finetuning.LlamaTokenizer.from_pretrained')
+@patch('llama_recipes.finetuning.get_preprocessed_dataset')
+@patch('llama_recipes.finetuning.optim.AdamW')
+@patch('llama_recipes.finetuning.StepLR')
+def test_finetuning_no_validation(step_lr, optimizer, get_dataset, tokenizer, get_model, train):
+    kwargs = {"run_validation": False}
+    
+    get_dataset.return_value = [1]
+    
+    main(**kwargs)
+    
+    assert train.call_count == 1
+    
+    args, kwargs = train.call_args
+    train_dataloader = args[1]
+    eval_dataloader = args[2]
+    
+    assert isinstance(train_dataloader, DataLoader)
+    assert eval_dataloader is None
+    
+    assert get_model.return_value.to.call_args.args[0] == "cuda"
+    
+    
+@patch('llama_recipes.finetuning.train')
+@patch('llama_recipes.finetuning.LlamaForCausalLM.from_pretrained')
+@patch('llama_recipes.finetuning.LlamaTokenizer.from_pretrained')
+@patch('llama_recipes.finetuning.get_preprocessed_dataset')
+@patch('llama_recipes.finetuning.optim.AdamW')
+@patch('llama_recipes.finetuning.StepLR')
+def test_finetuning_with_validation(step_lr, optimizer, get_dataset, tokenizer, get_model, train):
+    kwargs = {"run_validation": True}
+    get_dataset.return_value = [1]
+    
+    main(**kwargs)
+    
+    assert train.call_count == 1
+    
+    args, kwargs = train.call_args
+    train_dataloader = args[1]
+    eval_dataloader = args[2]
+    assert isinstance(train_dataloader, DataLoader)
+    assert isinstance(eval_dataloader, DataLoader)
+    
+    assert get_model.return_value.to.call_args.args[0] == "cuda"
+    
+    
+@patch('llama_recipes.finetuning.train')
+@patch('llama_recipes.finetuning.LlamaForCausalLM.from_pretrained')
+@patch('llama_recipes.finetuning.LlamaTokenizer.from_pretrained')
+@patch('llama_recipes.finetuning.get_preprocessed_dataset')
+@patch('llama_recipes.finetuning.generate_peft_config')
+@patch('llama_recipes.finetuning.get_peft_model')
+@patch('llama_recipes.finetuning.optim.AdamW')
+@patch('llama_recipes.finetuning.StepLR')
+def test_finetuning_peft(step_lr, optimizer, get_peft_model, gen_peft_config, get_dataset, tokenizer, get_model, train):
+    kwargs = {"use_peft": True}
+    
+    get_dataset.return_value = [1]
+    
+    main(**kwargs)
+    
+    assert get_peft_model.return_value.to.call_args.args[0] == "cuda"
+    assert get_peft_model.return_value.print_trainable_parameters.call_count == 1

+ 48 - 0
tests/test_train_utils.py

@@ -0,0 +1,48 @@
+import torch
+
+from llama_recipes.utils.train_utils import train
+
+def test_gradient_accumulation(mocker):
+    # import sys
+    # sys.path.append('/home/ubuntu/llama-recipes/')
+    
+    model = mocker.MagicMock(name="model")
+    model().loss.__truediv__().detach.return_value = torch.tensor(1)
+    batch = {"input": torch.zeros(1)}
+    train_dataloader = [batch, batch, batch, batch, batch]
+    eval_dataloader = None
+    tokenizer = mocker.MagicMock()
+    optimizer = mocker.MagicMock()
+    lr_scheduler = mocker.MagicMock()
+    gradient_accumulation_steps = 1
+    train_config = mocker.MagicMock()
+    train_config.enable_fsdp = False
+    train_config.use_fp16 = False
+    train_config.run_validation = False
+    
+    train(
+        model,
+        train_dataloader,
+        eval_dataloader,
+        tokenizer,
+        optimizer,
+        lr_scheduler,
+        gradient_accumulation_steps,
+        train_config,
+    )
+    
+    assert optimizer.zero_grad.call_count == 5
+    optimizer.zero_grad.reset_mock()
+    
+    gradient_accumulation_steps = 2
+    train(
+        model,
+        train_dataloader,
+        eval_dataloader,
+        tokenizer,
+        optimizer,
+        lr_scheduler,
+        gradient_accumulation_steps,
+        train_config,
+    )
+    assert optimizer.zero_grad.call_count == 3

+ 0 - 7
utils/__init__.py

@@ -1,7 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-from .memory_utils import MemoryTrace
-from .dataset_utils import *
-from .fsdp_utils import fsdp_auto_wrap_policy
-from .train_utils import *