Matthias Reso 8b0a233c1a Use new chat format in custom dataset | 9 months ago | |
---|---|---|
.. | ||
README.md | 10 months ago | |
custom_dataset.py | 9 months ago |
The provided fine tuning scripts allows you to select between three datasets by passing the dataset
arg to the llama_recipes.finetuning
module or recipes/finetuning/finetuning.py
script. The current options are grammar_dataset
, alpaca_dataset
and samsum_dataset
. Additionally, we integrate the OpenAssistant/oasst1 dataset as an example for a custom dataset Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
text-davinci-003
.Llama-recipes support two strategies to batch requests together.
The default setting is packing
which concatenates the tokenized samples into long sequences filling up the context length of the model.
This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
Therefore, we also support a padding
strategy which does not introduce the addition noise due to truncated sequences.
The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.
The batching strategy can be selected though the command line parameter --batching_strategy [packing]/[padding]
.
The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model. To use a custom dataset there are two possible ways. The first provides a function returning the dataset in a .py file which can be given to the command line tool. This does not involve changing the source code of llama-recipes. The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
def get_custom_dataset(dataset_config, tokenizer, split: str):
For an example get_custom_dataset
you can look at the provided datasets in llama_recipes.datasets or examples/custom_dataset.py.
The dataset_config
in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
The split signals wether to return the training or validation dataset.
The default function name is get_custom_dataset
but this can be changed as described below.
In order to start a training with the custom dataset we need to set the --dataset
as well as the --custom_dataset.file
parameter.
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
To change the function name that is used in the .py you can append the name following a :
like this:
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
This will call the function get_foo
instead of get_custom_dataset
when retrieving the dataset.
Each dataset has a corresponding configuration (dataclass) in configs/datasets.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the datasets folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data)
.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
To add a custom dataset the following steps need to be performed.
llama_recipes.finetuning
module or examples/finetuning.py training script.Below we list other datasets and their main use cases that can be used for fine tuning.
English quotes 2508 Multi-label text classification, text generation
More information on evaluation dataset can be found in HELM