|
@@ -0,0 +1,304 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Running Llama2 on Google Colab using Hugging Face transformers library\n",
|
|
|
+ "This notebook goes over how you can set up and run Llama2 using Hugging Face transformers library"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Steps at a glance:\n",
|
|
|
+ "This demo showcases how to run the example with already converted Llama 2 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
|
|
|
+ "\n",
|
|
|
+ "To use already converted weights, start here:\n",
|
|
|
+ "1. Request download of model weights from the Llama website\n",
|
|
|
+ "2. Prepare the script\n",
|
|
|
+ "3. Run the example\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
|
|
|
+ "1. Request download of model weights from the Llama website\n",
|
|
|
+ "2. Clone the llama repo and get the weights\n",
|
|
|
+ "3. Convert the model weights\n",
|
|
|
+ "4. Prepare the script\n",
|
|
|
+ "5. Run the example"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Using already converted weights"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 1. Request download of model weights from the Llama website\n",
|
|
|
+ "Request download of model weights from the Llama website\n",
|
|
|
+ "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
|
|
|
+ "\n",
|
|
|
+ "Fill the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 2. Prepare the script\n",
|
|
|
+ "\n",
|
|
|
+ "We will install the Transformers library and Accelerate library for our demo.\n",
|
|
|
+ "\n",
|
|
|
+ "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
|
|
|
+ "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "!pip install transformers\n",
|
|
|
+ "!pip install accelerate"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "from transformers import AutoTokenizer\n",
|
|
|
+ "import transformers\n",
|
|
|
+ "import torch"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 7b chat model `meta-llama/Llama-2-7b-chat-hf`."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "model = \"meta-llama/Llama-2-7b-chat-hf\"\n",
|
|
|
+ "tokenizer = AutoTokenizer.from_pretrained(model)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "pipeline = transformers.pipeline(\n",
|
|
|
+ "\"text-generation\",\n",
|
|
|
+ " model=model,\n",
|
|
|
+ " torch_dtype=torch.float16,\n",
|
|
|
+ " device_map=\"auto\",\n",
|
|
|
+ ")"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 3. Run the example\n",
|
|
|
+ "\n",
|
|
|
+ "Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
|
|
|
+ "\n",
|
|
|
+ "Let’s also generate a text sequence based on the input that we provide. "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "sequences = pipeline(\n",
|
|
|
+ " 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
|
|
|
+ " do_sample=True,\n",
|
|
|
+ " top_k=10,\n",
|
|
|
+ " num_return_sequences=1,\n",
|
|
|
+ " eos_token_id=tokenizer.eos_token_id,\n",
|
|
|
+ " truncation = True,\n",
|
|
|
+ " max_length=400,\n",
|
|
|
+ ")\n",
|
|
|
+ "\n",
|
|
|
+ "for seq in sequences:\n",
|
|
|
+ " print(f\"Result: {seq['generated_text']}\")"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "<br>\n",
|
|
|
+ "\n",
|
|
|
+ "### Downloading and converting weights to Hugging Face format"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 1. Request download of model weights from the Llama website\n",
|
|
|
+ "Request download of model weights from the Llama website\n",
|
|
|
+ "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
|
|
|
+ "\n",
|
|
|
+ "Fill the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time.\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 2. Clone the llama repo and get the weights\n",
|
|
|
+ "Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 7B-chat weights. This will download the tokenizer.model, and a directory llama-2-7b-chat with the weights in it.\n",
|
|
|
+ "\n",
|
|
|
+ "This example demonstrates a llama2 model with 7B-chat parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.\n",
|
|
|
+ "\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### 3. Convert the model weights\n",
|
|
|
+ "\n",
|
|
|
+ "* Create a link to the tokenizer:\n",
|
|
|
+ "Run `ln -h ./tokenizer.model ./llama-2-7b-chat/tokenizer.model` \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "* Convert the model weights to run with Hugging Face:``TRANSFORM=`python -c \"import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')\"``\n",
|
|
|
+ "\n",
|
|
|
+ "* Then run: `pip install protobuf && python $TRANSFORM --input_dir ./llama-2-7b-chat --model_size 7B --output_dir ./llama-2-7b-chat-hf`\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "\n",
|
|
|
+ "#### 4. Prepare the script\n",
|
|
|
+ "Import the following necessary modules in your script: \n",
|
|
|
+ "* `LlamaForCausalLM` is the Llama 2 model class\n",
|
|
|
+ "* `LlamaTokenizer` prepares your prompt for the model to process\n",
|
|
|
+ "* `pipeline` is an abstraction to generate model outputs\n",
|
|
|
+ "* `torch` allows us to use PyTorch and specify the datatype we’d like to use."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import torch\n",
|
|
|
+ "import transformers\n",
|
|
|
+ "from transformers import LlamaForCausalLM, LlamaTokenizer\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "model_dir = \"./llama-2-7b-chat-hf\"\n",
|
|
|
+ "model = LlamaForCausalLM.from_pretrained(model_dir)\n",
|
|
|
+ "\n",
|
|
|
+ "tokenizer = LlamaTokenizer.from_pretrained(model_dir)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`) among various other options. \n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "pipeline = transformers.pipeline(\n",
|
|
|
+ " \"text-generation\",\n",
|
|
|
+ " model=model,\n",
|
|
|
+ " tokenizer=tokenizer,\n",
|
|
|
+ " torch_dtype=torch.float16,\n",
|
|
|
+ " device_map=\"auto\",\n",
|
|
|
+ ")"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
|
|
|
+ "\n",
|
|
|
+ "By changing `max_length`, you can specify how long you’d like the generated response to be. \n",
|
|
|
+ "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
|
|
|
+ "\n",
|
|
|
+ "In your script, add the following to provide input, and information on how to run the pipeline:\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "#### 5. Run the example"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "sequences = pipeline(\n",
|
|
|
+ " 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
|
|
|
+ " do_sample=True,\n",
|
|
|
+ " top_k=10,\n",
|
|
|
+ " num_return_sequences=1,\n",
|
|
|
+ " eos_token_id=tokenizer.eos_token_id,\n",
|
|
|
+ " max_length=400,\n",
|
|
|
+ ")\n",
|
|
|
+ "for seq in sequences:\n",
|
|
|
+ " print(f\"{seq['generated_text']}\")\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "name": "python",
|
|
|
+ "version": "3.8.3"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 2
|
|
|
+}
|