1 year ago · 226a10df75
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -12,7 +12,7 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
				 
			
 
				 3. How do PEFT methods work with FSDP in terms of grad requirements/layer freezing?
			
 
				 
			
 
				-    We wrap the PEFT modules separate from the transfromer layer in auto_wrapping policy, that would result in PEFT models having `require_grad=True` while the rest of the model is  `require_grad=False`.
			
 
				+    We wrap the PEFT modules separate from the transformer layer in auto_wrapping policy, that would result in PEFT models having `require_grad=True` while the rest of the model is  `require_grad=False`.
			
 
				 
			
 
				 4. Can I add custom datasets?
			
 
				 
			
--- a/docs/LLM_finetuning.md
+++ b/docs/LLM_finetuning.md
@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only finetune a few layers. Ther
 
				 
			
 
				 
			
 
				 
			
 
				-In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter wont fit into one gpu.
			
 
				+In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
			
 
				 The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
			
 
				 For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
			
 
				 
			
--- a/docs/mutli_gpu.md
+++ b/docs/mutli_gpu.md
@@ -4,7 +4,7 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 
				 
			
 
				 1. [PEFT](https://huggingface.co/blog/peft) methods and in particular using the Hugging Face [PEFT](https://github.com/huggingface/peft)library.
			
 
				 
			
 
				-2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over mutiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
			
 
				+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
			
 
				 
			
 
				 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
			
 
				 
			
@@ -21,7 +21,7 @@ pip install -r requirements.txt
 
				 
			
 
				 ## How to run it
			
 
				 
			
 
				-Get access to a machine with mutiple GPUs ( in this case we tested with 4 A100 and A10s).
			
 
				+Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
			
 
				 This runs with the `samsum_dataset` for summarization application by default.
			
 
				 
			
 
				 **Multiple GPUs one node**:
			
@@ -68,7 +68,7 @@ sbatch multi_node.slurm
 
				 
			
 
				 ## How to run with different datasets?
			
 
				 
			
 
				-Currenty 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				+Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				 
			
 
				 * `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
			
 
				 
			
@@ -134,7 +134,7 @@ save_optimizer: bool=False
 
				 
			
 
				 * [Datasets config file](../configs/datasets.py) provides the available options for datasets.
			
 
				 
			
 
				-* [peft config file](../configs/peft.py) provides the suported PEFT methods and respective settings that can be modified.
			
 
				+* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
			
 
				 
			
 
				 * [FSDP config file](../configs/fsdp.py) provides FSDP settings such as:
			
 
				 
			
@@ -147,12 +147,12 @@ save_optimizer: bool=False
 
				 
			
 
				         * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
			
 
				 
			
 
				-        * `NO_SHARD` this is equivalant to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
			
 
				+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
			
 
				 
			
 
				         * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
			
 
				 
			
 
				 * `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
			
 
				 
			
 
				-* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves siginificant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
			
 
				+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
			
 
				 
			
 
				-* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if neccessary.
			
 
				+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
			
--- a/docs/single_gpu.md
+++ b/docs/single_gpu.md
@@ -40,7 +40,7 @@ The args used in the command above are:
 
				 
			
 
				 ## How to run with different datasets?
			
 
				 
			
 
				-Currenty 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				+Currently 4 datasets are supported that can be found in [Datasets config file](../configs/datasets.py).
			
 
				 
			
 
				 * `grammar_dataset` : use this [notebook](../ft_datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
			
 
				 
			
@@ -106,6 +106,6 @@ save_optimizer: bool=False
 
				 
			
 
				 ```
			
 
				 
			
 
				-* [Datasets config file](../configs/datasets.py) provides the avaiable options for datasets.
			
 
				+* [Datasets config file](../configs/datasets.py) provides the available options for datasets.
			
 
				 
			
 
				-* [peft config file](../configs/peft.py) provides the suported PEFT methods and respective settings that can be modified.
			
 
				+* [peft config file](../configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
			
--- a/inference/hf-text-generation-inference/README.md
+++ b/inference/hf-text-generation-inference/README.md
@@ -4,7 +4,7 @@ This document shows how to serve a fine tuned LLaMA mode with HuggingFace's text
 
				 
			
 
				 ## Step 0: Merging the weights (Only required if LoRA method was used) 
			
 
				 
			
 
				-In case the model was fine tuned with LoRA mehtod we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file.
			
 
				+In case the model was fine tuned with LoRA method we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file.
			
 
				 
			
 
				 The script takes the base model, the peft weight folder as well as an output as arguments:
			
 
				 
			
--- a/utils/train_utils.py
+++ b/utils/train_utils.py
@@ -35,11 +35,6 @@ from pathlib import Path
 
				 sys.path.append(str(Path(__file__).resolve().parent.parent))
			
 
				 from policies import bfSixteen, fpSixteen,bfSixteen_mixed, get_llama_wrapper
			
 
				 
			
 
				-scaler = ShardedGradScaler()
			
 
				-
			
 
				-
			
 
				-
			
 
				-
			
 
				 def set_tokenizer_params(tokenizer: LlamaTokenizer):
			
 
				     tokenizer.pad_token_id = 0
			
 
				     tokenizer.padding_side = "left"
			
@@ -67,8 +62,11 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				     Returns: results dictionary containing average training and validation perplexity and loss
			
 
				     """
			
 
				     # Create a gradient scaler for fp16
			
 
				-    scaler = torch.cuda.amp.GradScaler() if train_config.use_fp16 else None
			
 
				-
			
 
				+    if train_config.use_fp16 and train_config.enable_fsdp:
			
 
				+        scaler = ShardedGradScaler()
			
 
				+    elif train_config.use_fp16 and not train_config.enable_fsdp:
			
 
				+        scaler = torch.cuda.amp.GradScaler() 
			
 
				+        
			
 
				     train_prep = []
			
 
				     train_loss = []
			
 
				     val_prep = []
			
@@ -85,7 +83,7 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				                 for key in batch.keys():
			
 
				                     if train_config.enable_fsdp:
			
 
				                         batch[key] = batch[key].to(local_rank)
			
 
				-                    elif not train_config.quantization:
			
 
				+                    else:
			
 
				                         batch[key] = batch[key].to('cuda')       
			
 
				                 outputs = model(**batch)
			
 
				                 loss = outputs.loss
			
@@ -105,11 +103,9 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				                     loss.backward()
			
 
				                     if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
			
 
				                         optimizer.step()
			
 
				-                        lr_scheduler.step()
			
 
				                         optimizer.zero_grad()
			
 
				                         
			
 
				-                print(f"\n step {step} is completed and loss is {loss.detach().float()}")
			
 
				-
			
 
				+                print(f"\n step {step} is completed and loss is {loss.detach().float()}")        
			
 
				         # Reducing total_loss across all devices if there's more than one CUDA device
			
 
				         if torch.cuda.device_count() > 1 and train_config.enable_fsdp:
			
 
				             dist.all_reduce(total_loss, op=dist.ReduceOp.SUM)
			
@@ -123,7 +119,10 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				         print(f"Max CUDA memory reserved was {memtrace.max_reserved} GB")
			
 
				         print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
			
 
				         print(f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB")
			
 
				-            
			
 
				+        
			
 
				+        # Update the learning rate as needed
			
 
				+        lr_scheduler.step()
			
 
				+          
			
 
				         if train_config.run_validation:
			
 
				             eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, rank, tokenizer)   
			
 
				             if train_config.save_model and eval_epoch_loss < best_val_loss:
			
@@ -159,7 +158,9 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				             val_loss.append(best_val_loss)
			
 
				             val_prep.append(eval_ppl)
			
 
				         
			
 
				+        
			
 
				         print(f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}")
			
 
				+        lr_scheduler.step()
			
 
				 
			
 
				     avg_train_prep = sum(train_prep)/len(train_prep)
			
 
				     avg_train_loss = sum(train_loss)/len(train_loss)