# Datasets and Evaluation Metrics

The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.

## Adding custom datasets

The list of available datasets can easily be extended with custom datasets by following these instructions.

Each dataset has a corresponding configuration (dataclass) in [configs/dataset.py](../configs/dataset.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

1. Create a dataset configuration after the schema described above. Examples can be found in [configs/dataset.py](../configs/dataset.py).
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.

## Application
Below we list other datasets and their main use cases that can be used for fine tuning.

### Q&A these can be used for evaluation as well
- [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation)
- [BoolQ](https://huggingface.co/datasets/boolq)
- [NarrativeQA](https://huggingface.co/datasets/narrativeqa)
- [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book)
- [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book)
- [QuAC](https://huggingface.co/datasets/quac)
- [HellaSwag](https://huggingface.co/datasets/hellaswag)
- [OpenbookQA](https://huggingface.co/datasets/openbookqa)
- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model)


### instruction finetuning
- [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned)	52k	instruction tuning
- [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k	15k	instruction tuning


### simple text generation for quick tests
[English](https://huggingface.co/datasets/Abirate/english_quotes) quotes	2508	Multi-label text classification, text generation


### Reasoning used mostly for evaluation of LLMs
- [bAbI](https://research.facebook.com/downloads/babi/)
- [Dyck](https://huggingface.co/datasets/dyk)
- [GSM8K](https://huggingface.co/datasets/gsm8k)
- [MATH](https://github.com/hendrycks/math)
- [APPS](https://huggingface.co/datasets/codeparrot/apps)
- [HumanEval](https://huggingface.co/datasets/openai_humaneval)
- [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar)
- [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching)

### Toxicity evaluation
- [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)

### Bias evaluation
- [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
- WinoGender gender bias

### Useful Links
More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)