|
@@ -6,10 +6,35 @@ The provided fine tuning script allows you to select between three datasets by p
|
|
|
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
|
|
|
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
|
|
|
|
|
|
-## Adding custom datasets
|
|
|
-
|
|
|
-The list of available datasets can easily be extended with custom datasets by following these instructions.
|
|
|
-
|
|
|
+## Using custom datasets
|
|
|
+
|
|
|
+The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
|
|
|
+To use a custom dataset there are two possible ways.
|
|
|
+The first provides a function returning the dataset in a .py file which can be given to the command line tool.
|
|
|
+This does not involve changing the source code of llama-recipes.
|
|
|
+The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
|
|
|
+
|
|
|
+### Training on custom data
|
|
|
+To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
|
|
|
+```@python
|
|
|
+def get_custom_dataset(dataset_config, tokenizer, split: str):
|
|
|
+```
|
|
|
+For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](examples/custom_dataset.py).
|
|
|
+The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
|
|
|
+The split signals wether to return the training or validation dataset.
|
|
|
+The default function name is `get_custom_dataset` but this can be changes as described below.
|
|
|
+
|
|
|
+In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
|
|
|
+```
|
|
|
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
|
|
|
+```
|
|
|
+To change the function name that is used in the .py you can append the name following a `:` like this:
|
|
|
+```
|
|
|
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
|
|
|
+```
|
|
|
+This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
|
|
|
+
|
|
|
+### Adding new dataset
|
|
|
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
|
|
|
|
|
|
Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
|