# Datasets and Evaluation Metrics The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses) * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections. * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`. * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries. ## Adding custom datasets The list of available datasets can easily be extended with custom datasets by following these instructions. Each dataset has a corresponding configuration (dataclass) in [configs/dataset.py](../configs/dataset.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc. Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder. The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```. For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields. To add a custom dataset the following steps need to be performed. 1. Create a dataset configuration after the schema described above. Examples can be found in [configs/dataset.py](../configs/dataset.py). 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass. 3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py) 4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script. ## Application Below we list other datasets and their main use cases that can be used for fine tuning. ### Q&A these can be used for evaluation as well - [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation) - [BoolQ](https://huggingface.co/datasets/boolq) - [NarrativeQA](https://huggingface.co/datasets/narrativeqa) - [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book) - [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book) - [QuAC](https://huggingface.co/datasets/quac) - [HellaSwag](https://huggingface.co/datasets/hellaswag) - [OpenbookQA](https://huggingface.co/datasets/openbookqa) - [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model) ### instruction finetuning - [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned) 52k instruction tuning - [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k 15k instruction tuning ### simple text generation for quick tests [English](https://huggingface.co/datasets/Abirate/english_quotes) quotes 2508 Multi-label text classification, text generation ### Reasoning used mostly for evaluation of LLMs - [bAbI](https://research.facebook.com/downloads/babi/) - [Dyck](https://huggingface.co/datasets/dyk) - [GSM8K](https://huggingface.co/datasets/gsm8k) - [MATH](https://github.com/hendrycks/math) - [APPS](https://huggingface.co/datasets/codeparrot/apps) - [HumanEval](https://huggingface.co/datasets/openai_humaneval) - [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar) - [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching) ### Toxicity evaluation - [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts) ### Bias evaluation - [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias - WinoGender gender bias ### Useful Links More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)