# Datasets and Evaluation Metrics

The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
* [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.

## Batching Strategies
Llama-recipes support two strategies to batch requests together.
The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.

If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences.
The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.

The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`.

## Using custom datasets

The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
To use a custom dataset there are two possible ways.
The first provides a function returning the dataset in a .py file which can be given to the command line tool.
This does not involve changing the source code of llama-recipes.
The second way is targeting contributions which extend llama-recipes as it involves changing the source code.

### Training on custom data
To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
```@python
def get_custom_dataset(dataset_config, tokenizer, split: str):
```
For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](custom_dataset.py).
The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
The split signals wether to return the training or validation dataset.
The default function name is `get_custom_dataset` but this can be changed as described below.

In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
```
To change the function name that is used in the .py you can append the name following a `:` like this:
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
```
This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.

### Adding new dataset
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../../../src/llama_recipes/utils/dataset_utils.py)
4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.

## Application
Below we list other datasets and their main use cases that can be used for fine tuning.

### Q&A these can be used for evaluation as well
- [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation)
- [BoolQ](https://huggingface.co/datasets/boolq)
- [NarrativeQA](https://huggingface.co/datasets/narrativeqa)
- [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book)
- [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book)
- [QuAC](https://huggingface.co/datasets/quac)
- [HellaSwag](https://huggingface.co/datasets/hellaswag)
- [OpenbookQA](https://huggingface.co/datasets/openbookqa)
- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model)


### instruction finetuning
- [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned)	52k	instruction tuning
- [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k	15k	instruction tuning


### simple text generation for quick tests
[English](https://huggingface.co/datasets/Abirate/english_quotes) quotes	2508	Multi-label text classification, text generation


### Reasoning used mostly for evaluation of LLMs
- [bAbI](https://research.facebook.com/downloads/babi/)
- [Dyck](https://huggingface.co/datasets/dyk)
- [GSM8K](https://huggingface.co/datasets/gsm8k)
- [MATH](https://github.com/hendrycks/math)
- [APPS](https://huggingface.co/datasets/codeparrot/apps)
- [HumanEval](https://huggingface.co/datasets/openai_humaneval)
- [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar)
- [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching)

### Toxicity evaluation
- [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)

### Bias evaluation
- [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
- WinoGender gender bias

### Useful Links
More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)