1 year ago · dc5780f65c
--- a/README.md
+++ b/README.md
@@ -92,7 +92,7 @@ All the parameters in the examples and recipes below need to be further tuned to
 
				 #if running on multi-gpu machine
			
 
				 export CUDA_VISIBLE_DEVICES=0
			
 
				 
			
 
				-python llama_finetuning.py  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```
			
 
				 
			
--- a/docs/Dataset.md
+++ b/docs/Dataset.md
@@ -1,6 +1,6 @@
 
				 # Datasets and Evaluation Metrics
			
 
				 
			
 
				-The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
			
 
				+The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
			
 
				 
			
 
				 * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
			
 
				 * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
			
@@ -10,18 +10,18 @@ The provided fine tuning script allows you to select between three datasets by p
 
				 
			
 
				 The list of available datasets can easily be extended with custom datasets by following these instructions.
			
 
				 
			
 
				-Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
			
 
				+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
			
 
				 
			
 
				-Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
			
 
				+Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
			
 
				 The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
			
 
				 For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
			
 
				 
			
 
				 To add a custom dataset the following steps need to be performed.
			
 
				 
			
 
				-1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../configs/datasets.py).
			
 
				+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../src/llama_recipes/configs/datasets.py).
			
 
				 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
			
 
				-3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
			
 
				-4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.
			
 
				+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../src/llama_recipes/utils/dataset_utils.py)
			
 
				+4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or llama_finetuning.py training script.
			
 
				 
			
 
				 ## Application
			
 
				 Below we list other datasets and their main use cases that can be used for fine tuning.
			
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -34,8 +34,8 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
				 os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'
			
 
				 
			
 
				 ```
			
 
				-We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../utils/train_utils.py), feel free to uncomment it if required.
			
 
				+We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
			
 
				 
			
 
				 8. Additional debugging flags? the environment variable `TORCH_DISTRIBUTED_DEBUG` can be used to trigger additional useful logging and collective synchronization checks to ensure all ranks are synchronized appropriately. `TORCH_DISTRIBUTED_DEBUG` can be set to either OFF (default), INFO, or DETAIL depending on the debugging level required. Please note that the most verbose option, DETAIL may impact the application performance and thus should only be used when debugging issues.
			
 
				 
			
 
				-We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../utils/train_utils.py), feel free to uncomment it if required.
			
 
				+We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.