# Datasets and Evaluation Metrics The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](../examples/custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses) * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections. * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`. * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries. * [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations. ## Using custom datasets The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model. To use a custom dataset there are two possible ways. The first provides a function returning the dataset in a .py file which can be given to the command line tool. This does not involve changing the source code of llama-recipes. The second way is targeting contributions which extend llama-recipes as it involves changing the source code. ### Training on custom data To supply a custom dataset you need to provide a single .py file which contains a function with the following signature: ```@python def get_custom_dataset(dataset_config, tokenizer, split: str): ``` For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](../examples/custom_dataset.py). The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line. The split signals wether to return the training or validation dataset. The default function name is `get_custom_dataset` but this can be changes as described below. In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter. ``` python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS] ``` To change the function name that is used in the .py you can append the name following a `:` like this: ``` python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS] ``` This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset. ### Adding new dataset Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc. Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder. The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```. For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields. To add a custom dataset the following steps need to be performed. 1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../src/llama_recipes/configs/datasets.py). 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass. 3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../src/llama_recipes/utils/dataset_utils.py) 4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script. ## Application Below we list other datasets and their main use cases that can be used for fine tuning. ### Q&A these can be used for evaluation as well - [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation) - [BoolQ](https://huggingface.co/datasets/boolq) - [NarrativeQA](https://huggingface.co/datasets/narrativeqa) - [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book) - [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book) - [QuAC](https://huggingface.co/datasets/quac) - [HellaSwag](https://huggingface.co/datasets/hellaswag) - [OpenbookQA](https://huggingface.co/datasets/openbookqa) - [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model) ### instruction finetuning - [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned) 52k instruction tuning - [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k 15k instruction tuning ### simple text generation for quick tests [English](https://huggingface.co/datasets/Abirate/english_quotes) quotes 2508 Multi-label text classification, text generation ### Reasoning used mostly for evaluation of LLMs - [bAbI](https://research.facebook.com/downloads/babi/) - [Dyck](https://huggingface.co/datasets/dyk) - [GSM8K](https://huggingface.co/datasets/gsm8k) - [MATH](https://github.com/hendrycks/math) - [APPS](https://huggingface.co/datasets/codeparrot/apps) - [HumanEval](https://huggingface.co/datasets/openai_humaneval) - [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar) - [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching) ### Toxicity evaluation - [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts) ### Bias evaluation - [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias - WinoGender gender bias ### Useful Links More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)