Quellcode durchsuchen

Added documentation for custom dataset

Matthias Reso vor 1 Jahr
Ursprung
Commit
64bbe5bfe1
3 geänderte Dateien mit 35 neuen und 6 gelöschten Zeilen
  1. 1 1
      README.md
  2. 29 4
      docs/Dataset.md
  3. 5 1
      examples/README.md

+ 1 - 1
README.md

@@ -101,7 +101,7 @@ If you want to dive right into single or multi GPU fine-tuning, run the examples
 All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand.
 
 **Note:**
-* To change the dataset in the commands below pass the `dataset` arg. Current options for dataset are `grammar_dataset`, `alpaca_dataset`and  `samsum_dataset`. A description of the datasets and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md). For  `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
+* To change the dataset in the commands below pass the `dataset` arg. Current options for integarted dataset are `grammar_dataset`, `alpaca_dataset`and  `samsum_dataset`. A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For  `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
 
 * Default dataset and other LORA config has been set to `samsum_dataset`.
 

+ 29 - 4
docs/Dataset.md

@@ -6,10 +6,35 @@ The provided fine tuning script allows you to select between three datasets by p
 * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
 * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
 
-## Adding custom datasets
-
-The list of available datasets can easily be extended with custom datasets by following these instructions.
-
+## Using custom datasets
+
+The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
+To use a custom dataset there are two possible ways.
+The first provides a function returning the dataset in a .py file which can be given to the command line tool.
+This does not involve changing the source code of llama-recipes.
+The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
+
+### Training on custom data
+To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
+```@python
+def get_custom_dataset(dataset_config, tokenizer, split: str):
+```
+For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](examples/custom_dataset.py).
+The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
+The split signals wether to return the training or validation dataset.
+The default function name is `get_custom_dataset` but this can be changes as described below.
+
+In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter. 
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
+```
+To change the function name that is used in the .py you can append the name following a `:` like this:
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
+```
+This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
+
+### Adding new dataset 
 Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
 
 Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.

+ 5 - 1
examples/README.md

@@ -31,4 +31,8 @@ For more in depth information on inference including inference safety checks and
 
 **Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
 
-**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
+**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
+
+## Train on custom dataset
+To show how to train a model on a custom dataset we provide an example to generate a custom dataset in [custom_dataset.py](./custom_dataset.py).
+The usage of the custom dataset is further decribed in the datasets [README](../docs/Dataset.md#training-on-custom-data).