1 year ago · a7fe0396c5
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				     - [Single GPU](#single-gpu)
			
 
				     - [Multi GPU One Node](#multiple-gpus-one-node)
			
 
				     - [Multi GPU Multi Node](#multi-gpu-multi-node)
			
 
				-3. [Inference](./inference/inference.md)
			
 
				+3. [Inference](./docs/inference.md)
			
 
				 4. [Model Conversion](#model-conversion-to-hugging-face)
			
 
				 5. [Repository Organization](#repository-organization)
			
 
				 6. [License and Acceptable Use Policy](#license)
			
@@ -22,7 +22,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				 
			
 
				 [Llama 2 Jupyter Notebook](quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
			
 
				 
			
 
				-**Note** All the setting defined in [config files](./configs/) can be passed as args through CLI when running the sctipt, there is no need to change from config files directly.
			
 
				+**Note** All the setting defined in [config files](./configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
			
 
				 
			
 
				 **Note** In case need to run PEFT model with FSDP, please make sure to use the PyTorch Nightlies.
			
 
				 
			
@@ -32,7 +32,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				 * [Multi-GPU Fine-tuning](./docs/mutli_gpu.md)
			
 
				 * [LLM Fine-tuning](./docs/LLM_finetuning.md)
			
 
				 * [Adding custom datasets](./docs/Dataset.md)
			
 
				-* [Inference](./inference/inference.md)
			
 
				+* [Inference](./docs/inference.md)
			
 
				 * [FAQs](./docs/FAQ.md)
			
 
				 
			
 
				 ## Requirements
			
@@ -62,7 +62,7 @@ All the parameters in the examples and recipes below need to be further tuned to
 
				 
			
 
				 * Make sure to set the right path to the model in the [training config](./configs/training.py).
			
 
				 
			
 
				-### Single GPU :
			
 
				+### Single GPU:
			
 
				 
			
 
				 ```bash
			
 
				 #if running on multi-gpu machine
			
--- a/docs/Dataset.md
+++ b/docs/Dataset.md
@@ -62,7 +62,7 @@ Below we list other datasets and their main use cases that can be used for fine
 
				 
			
 
				 ### Bias evaluation
			
 
				 - [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
			
 
				-- [WinoGender] gender bias
			
 
				+- WinoGender gender bias
			
 
				 
			
 
				 ### Useful Links
			
 
				 More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)
			
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -12,8 +12,17 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
				 
			
 
				 3. How do PEFT methods work with FSDP in terms of grad requirements/layer freezing?
			
 
				 
			
 
				-    We wrap the PEFT modules separate from the transfromer layer in auto_wrapping policy, that would result in PEFT models having `require_grad=True` while the rest of the model is  `require_grad=False`.
			
 
				+    We wrap the PEFT modules separate from the transformer layer in auto_wrapping policy, that would result in PEFT models having `require_grad=True` while the rest of the model is  `require_grad=False`.
			
 
				 
			
 
				 4. Can I add custom datasets?
			
 
				 
			
 
				     Yes, you can find more information on how to do that [here](Dataset.md).
			
 
				+
			
 
				+5. What are the hardware SKU requirements for deploying these models?
			
 
				+
			
 
				+    Hardware requirements vary based on latency, throughput and cost constraints. For good latency, the models were split across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs like A10G, T4, L4, or even commodity hardware can also be used to deploy these models (e.g. https://github.com/ggerganov/llama.cpp).
			
 
				+    If working on a CPU, it is worth looking at this [blog post](https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html) from Intel for an idea of Llama 2's performance on a CPU.
			
 
				+
			
 
				+6. What are the hardware SKU requirements for fine-tuning Llama pre-trained models?
			
 
				+
			
 
				+    Fine-tuning requirements vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types like NVIDIA A10G or H100 are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora).
			
--- a/docs/LLM_finetuning.md
+++ b/docs/LLM_finetuning.md
@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only finetune a few layers. Ther
 
				 
			
 
				 
			
 
				 
			
 
				-In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter wont fit into one gpu.
			
 
				+In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
			
 
				 The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
			
 
				 For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
			
 
				 
			
--- a/inference/inference.md
+++ b/inference/inference.md
@@ -1,6 +1,6 @@
 
				 # Inference
			
 
				 
			
 
				-For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
			
 
				+For inference we have provided an [inference script](../inference/inference.py). Depending on the type of finetuning performed during training the [inference script](../inference/inference.py) takes different arguments.
			
 
				 To finetune all model parameters the output dir of the training has to be given as --model_name argument.
			
 
				 In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
			
 
				 Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
			
@@ -41,6 +41,12 @@ Alternate inference options include:
 
				 
			
 
				 [**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
			
 
				 To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
			
 
				-Once installed, you can use the vLLM_ineference.py script provided [here](vLLM_inference.py).
			
 
				+Once installed, you can use the vLLM_ineference.py script provided [here](../inference/vLLM_inference.py).
			
 
				 
			
 
				-[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](https://github.com/huggingface/text-generation-inference).
			
 
				+Below is an example of how to run the vLLM_inference.py script found within the inference folder.
			
 
				+
			
 
				+``` bash
			
 
				+python vLLM_inference.py --model_name <PATH/TO/MODEL/7B>
			
 
				+```
			
 
				+
			
 
				+[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../inference/hf-text-generation-inference/README.md).
			
--- a/docs/mutli_gpu.md
+++ b/docs/mutli_gpu.md
@@ -4,7 +4,7 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 
				 
			
 
				 1. [PEFT](https://huggingface.co/blog/peft) methods and in particular using the Hugging Face [PEFT](https://github.com/huggingface/peft)library.
			
 
				 
			
 
				-2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over mutiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
			
 
				+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
			
 
				 
			
 
				 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
			
 
				 
			
@@ -21,7 +21,7 @@ pip install -r requirements.txt
 
				 
			
 
				 ## How to run it
			
 
				 
			
 
				-Get access to a machine with mutiple GPUs ( in this case we tested with 4 A100 and A10s).
			
 
				+Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
			
 
				 This runs with the `samsum_dataset` for summarization application by default.
			
 
				 
			
 
				 **Multiple GPUs one node**:
			
@@ -68,7 +68,7 @@ sbatch multi_node.slurm
 
				 
			
 
				 ## How to run with different datasets?
			
 
				 
			
 
				-Currenty 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				+Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				 
			
 
				 * `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
			
 
				 
			
@@ -132,9 +132,9 @@ save_optimizer: bool=False
 
				 
			
 
				 ```
			
 
				 
			
 
				-* [Datasets config file](../configs/datasets.py) provides the avaiable options for datasets.
			
 
				+* [Datasets config file](../configs/datasets.py) provides the available options for datasets.
			
 
				 
			
 
				-* [peft config file](../configs/peft.py) provides the suported PEFT methods and respective settings that can be modified.
			
 
				+* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
			
 
				 
			
 
				 * [FSDP config file](../configs/fsdp.py) provides FSDP settings such as:
			
 
				 
			
@@ -147,12 +147,12 @@ save_optimizer: bool=False
 
				 
			
 
				         * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
			
 
				 
			
 
				-        * `NO_SHARD` this is equivalant to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
			
 
				+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
			
 
				 
			
 
				         * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
			
 
				 
			
 
				 * `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
			
 
				 
			
 
				-* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves siginificant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
			
 
				+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
			
 
				 
			
 
				-* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if neccessary.
			
 
				+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
			
--- a/docs/single_gpu.md
+++ b/docs/single_gpu.md
@@ -40,7 +40,7 @@ The args used in the command above are:
 
				 
			
 
				 ## How to run with different datasets?
			
 
				 
			
 
				-Currenty 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				+Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				 
			
 
				 * `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
			
 
				 
			
@@ -106,6 +106,6 @@ save_optimizer: bool=False
 
				 
			
 
				 ```
			
 
				 
			
 
				-* [Datasets config file](../configs/datasets.py) provides the avaiable options for datasets.
			
 
				+* [Datasets config file](../configs/datasets.py) provides the available options for datasets.
			
 
				 
			
 
				-* [peft config file](../configs/peft.py) provides the suported PEFT methods and respective settings that can be modified.
			
 
				+* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
			
--- a/inference/README.md
+++ b/inference/README.md
@@ -0,0 +1,11 @@
 
				+# Inference
			
 
				+
			
 
				+This folder contains inference examples for Llama 2. So far, we have provided support for three methods of inference:
			
 
				+
			
 
				+1. [inference script](inference.py) script provides support for Hugging Face accelerate and PEFT fine tuned models.
			
 
				+
			
 
				+2. [vLLM_inference.py](vLLM_inference.py) script takes advantage of vLLM's paged attention concept for low latency.
			
 
				+
			
 
				+3. The [hf-text-generation-inference](hf-text-generation-inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
			
 
				+
			
 
				+For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
			
--- a/inference/chat_completion.py
+++ b/inference/chat_completion.py
@@ -62,13 +62,11 @@ def main(
 
				     tokenizer = LlamaTokenizer.from_pretrained(model_name)
			
 
				     tokenizer.add_special_tokens(
			
 
				         {
			
 
				-            "eos_token": "</s>",
			
 
				-            "bos_token": "</s>",
			
 
				-            "unk_token": "</s>",
			
 
				-            "pad_token": "[PAD]",
			
 
				+         
			
 
				+            "pad_token": "<PAD>",
			
 
				         }
			
 
				     )
			
 
				-
			
 
				+    
			
 
				     chats = format_tokens(dialogs, tokenizer)
			
 
				 
			
 
				     with torch.no_grad():
			
--- a/inference/hf-text-generation-inference/README.md
+++ b/inference/hf-text-generation-inference/README.md
@@ -4,7 +4,7 @@ This document shows how to serve a fine tuned LLaMA mode with HuggingFace's text
 
				 
			
 
				 ## Step 0: Merging the weights (Only required if LoRA method was used) 
			
 
				 
			
 
				-In case the model was fine tuned with LoRA mehtod we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file.
			
 
				+In case the model was fine tuned with LoRA method we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file.
			
 
				 
			
 
				 The script takes the base model, the peft weight folder as well as an output as arguments:
			
 
				 
			
--- a/inference/inference.py
+++ b/inference/inference.py
@@ -7,6 +7,7 @@ import fire
 
				 import torch
			
 
				 import os
			
 
				 import sys
			
 
				+import time
			
 
				 from typing import List
			
 
				 
			
 
				 from transformers import LlamaTokenizer
			
@@ -58,22 +59,13 @@ def main(
 
				     # Set the seeds for reproducibility
			
 
				     torch.cuda.manual_seed(seed)
			
 
				     torch.manual_seed(seed)
			
 
				-    # model = load_model(model_name, quantization)
			
 
				-    model_def = load_llama_from_config()
			
 
				-    # print(dir(model_def))
			
 
				-    # model_def.eval()
			
 
				-    model = load_sharded_model_single_gpu(model_def, model_name)
			
 
				-    model.to(torch.bfloat16)
			
 
				-    model.to("cuda:0")
			
 
				-    print("model has been loaded *******************")
			
 
				-
			
 
				-    tokenizer = LlamaTokenizer.from_pretrained("../../../hf-llama-pr/7B/")
			
 
				+    
			
 
				+    model = load_model(model_name, quantization)
			
 
				+    tokenizer = LlamaTokenizer.from_pretrained(model_name)
			
 
				     tokenizer.add_special_tokens(
			
 
				         {
			
 
				-            "eos_token": "</s>",
			
 
				-            "bos_token": "</s>",
			
 
				-            "unk_token": "</s>",
			
 
				-            "pad_token": "[PAD]",
			
 
				+         
			
 
				+            "pad_token": "<PAD>",
			
 
				         }
			
 
				     )
			
 
				     
			
@@ -104,7 +96,7 @@ def main(
 
				 
			
 
				     batch = tokenizer(user_prompt, return_tensors="pt")
			
 
				     batch = {k: v.to("cuda") for k, v in batch.items()}
			
 
				-    
			
 
				+    start = time.perf_counter()
			
 
				     with torch.no_grad():
			
 
				         outputs = model.generate(
			
 
				             **batch,
			
@@ -119,7 +111,8 @@ def main(
 
				             length_penalty=length_penalty,
			
 
				             **kwargs 
			
 
				         )
			
 
				-
			
 
				+    e2e_inference_time = (time.perf_counter()-start)*1000
			
 
				+    print(f"the inference time is {e2e_inference_time} ms")
			
 
				     output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
			
 
				     
			
 
				     # Safety check of the model output
			
--- a/llama_finetuning.py
+++ b/llama_finetuning.py
@@ -109,13 +109,11 @@ def main(**kwargs):
 
				     # Load the tokenizer and add special tokens
			
 
				     tokenizer = LlamaTokenizer.from_pretrained(train_config.model_name)
			
 
				     tokenizer.add_special_tokens(
			
 
				-        {
			
 
				-            "eos_token": "</s>",
			
 
				-            "bos_token": "</s>",
			
 
				-            "unk_token": "</s>",
			
 
				-            "pad_token": '[PAD]',
			
 
				-        }
			
 
				-    )
			
 
				+            {
			
 
				+            
			
 
				+                "pad_token": "<PAD>",
			
 
				+            }
			
 
				+        )
			
 
				     if train_config.use_peft:
			
 
				         peft_config = generate_peft_config(train_config, kwargs)
			
 
				         model = get_peft_model(model, peft_config)
			
--- a/utils/train_utils.py
+++ b/utils/train_utils.py
@@ -35,11 +35,6 @@ from pathlib import Path
 
				 sys.path.append(str(Path(__file__).resolve().parent.parent))
			
 
				 from policies import bfSixteen, fpSixteen,bfSixteen_mixed, get_llama_wrapper
			
 
				 
			
 
				-scaler = ShardedGradScaler()
			
 
				-
			
 
				-
			
 
				-
			
 
				-
			
 
				 def set_tokenizer_params(tokenizer: LlamaTokenizer):
			
 
				     tokenizer.pad_token_id = 0
			
 
				     tokenizer.padding_side = "left"
			
@@ -67,8 +62,11 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				     Returns: results dictionary containing average training and validation perplexity and loss
			
 
				     """
			
 
				     # Create a gradient scaler for fp16
			
 
				-    scaler = torch.cuda.amp.GradScaler() if train_config.use_fp16 else None
			
 
				-
			
 
				+    if train_config.use_fp16 and train_config.enable_fsdp:
			
 
				+        scaler = ShardedGradScaler()
			
 
				+    elif train_config.use_fp16 and not train_config.enable_fsdp:
			
 
				+        scaler = torch.cuda.amp.GradScaler() 
			
 
				+        
			
 
				     train_prep = []
			
 
				     train_loss = []
			
 
				     val_prep = []
			
@@ -85,7 +83,7 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				                 for key in batch.keys():
			
 
				                     if train_config.enable_fsdp:
			
 
				                         batch[key] = batch[key].to(local_rank)
			
 
				-                    elif not train_config.quantization:
			
 
				+                    else:
			
 
				                         batch[key] = batch[key].to('cuda')       
			
 
				                 outputs = model(**batch)
			
 
				                 loss = outputs.loss
			
@@ -105,11 +103,9 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				                     loss.backward()
			
 
				                     if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
			
 
				                         optimizer.step()
			
 
				-                        lr_scheduler.step()
			
 
				                         optimizer.zero_grad()
			
 
				                         
			
 
				-                print(f"\n step {step} is completed and loss is {loss.detach().float()}")
			
 
				-
			
 
				+                print(f"\n step {step} is completed and loss is {loss.detach().float()}")        
			
 
				         # Reducing total_loss across all devices if there's more than one CUDA device
			
 
				         if torch.cuda.device_count() > 1 and train_config.enable_fsdp:
			
 
				             dist.all_reduce(total_loss, op=dist.ReduceOp.SUM)
			
@@ -123,7 +119,10 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				         print(f"Max CUDA memory reserved was {memtrace.max_reserved} GB")
			
 
				         print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
			
 
				         print(f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB")
			
 
				-            
			
 
				+        
			
 
				+        # Update the learning rate as needed
			
 
				+        lr_scheduler.step()
			
 
				+          
			
 
				         if train_config.run_validation:
			
 
				             eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, rank, tokenizer)   
			
 
				             if train_config.save_model and eval_epoch_loss < best_val_loss:
			
@@ -159,7 +158,9 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				             val_loss.append(best_val_loss)
			
 
				             val_prep.append(eval_ppl)
			
 
				         
			
 
				+        
			
 
				         print(f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}")
			
 
				+        lr_scheduler.step()
			
 
				 
			
 
				     avg_train_prep = sum(train_prep)/len(train_prep)
			
 
				     avg_train_loss = sum(train_loss)/len(train_loss)