|
@@ -1,6 +1,6 @@
|
|
|
# Datasets and Evaluation Metrics
|
|
|
|
|
|
-The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
|
|
|
+The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
|
|
|
|
|
|
* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
|
|
|
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
|
|
@@ -10,18 +10,18 @@ The provided fine tuning script allows you to select between three datasets by p
|
|
|
|
|
|
The list of available datasets can easily be extended with custom datasets by following these instructions.
|
|
|
|
|
|
-Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
|
|
|
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
|
|
|
|
|
|
-Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
|
|
|
+Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
|
|
|
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
|
|
|
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
|
|
|
|
|
|
To add a custom dataset the following steps need to be performed.
|
|
|
|
|
|
-1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../configs/datasets.py).
|
|
|
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../src/llama_recipes/configs/datasets.py).
|
|
|
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
|
|
|
-3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
|
|
|
-4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.
|
|
|
+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../src/llama_recipes/utils/dataset_utils.py)
|
|
|
+4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or llama_finetuning.py training script.
|
|
|
|
|
|
## Application
|
|
|
Below we list other datasets and their main use cases that can be used for fine tuning.
|