## Running Llama2 on Mac
This notebook goes over how you can set up and run Llama2 locally on a Mac using llama-cpp-python and the llama-cpp's quantized Llama2 model. It also goes over how to use LangChain to ask Llama general questions

### Steps at a glance:
1. Use CMAKE and install required packages
2. Request download of model weights from the Llama website
3. Clone the llama repo and get the weights
4. Clone the llamacpp repo and quantize the model
5. Prepare the script
6. Run the example


<br>

#### 1. Use CMAKE and install required packages

Type the following command:

In [None]:
#CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1: sets the appropriate build configuration options for the llama-cpp-python package 
#and enables the use of Metal in Mac and forces the use of CMake as the build system.
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

#pip install llama-cpp-python: installs the llama-cpp-python package and its dependencies:
!pip install pypdf sentence-transformers chromadb langchain

If running without a Jupyter notebook, use the command without the `!`

A brief look at the installed libraries:
- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) a simple Python bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) library
- pypdf gives us the ability to work with pdfs
- sentence-transformers for text embeddings
- chromadb gives us database capabilities 
- langchain provides necessary RAG tools for this demo

<br>

#### 2. Request download of model weights from the Llama website
Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. 
Fill  the required information, select the models “Llama 2 & Llama Chat” and accept the terms & conditions. You will receive a URL in your email in a short time.


<br>

#### 3. Clone the llama repo and get the weights
Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 13B weights. This example demonstrates a llama2 model with 13B parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.



<br>

#### 4. Clone the llamacpp repo and quantize the model
* Git clone the [Llamacpp repo](https://github.com/ggerganov/llama.cpp). 
* Enter the repo:
`cd llama.cpp`
* Install requirements:
`python3 -m pip install -r requirements.txt`
* Convert the weights:
`python convert.py <path_to_your_downloaded_llama-2-13b_model>`
* Run make to generate the 'quantize' method that we will use in the next step
`make`
* Quantize the weights:
`./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0`


#### 5. Prepare the script


In [None]:
# mentions the instance of the Llama model that we will use
from langchain.llms import LlamaCpp

# defines a chain of operations that can be performed on text input to generate the output using the LLM
from langchain.chains import LLMChain

# manages callbacks that are triggered at various stages during the execution of an LLMChain
from langchain.callbacks.manager import CallbackManager

# defines a callback that streams the output of the LLMChain to the console in real-time as it gets generated
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# allows to define prompt templates that can be used to generate custom inputs for the LLM
from langchain.prompts import PromptTemplate


# Initialize the langchain CallBackManager. This handles callbacks from Langchain and for this example we will use 
# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Set up the model
llm = LlamaCpp(
    model_path="<path-to-llama-gguf-file>",
    temperature=0.0,
    top_p=1,
    n_ctx=6000,
    callback_manager=callback_manager, 
    verbose=True,
)

#### 6. Run the example

With the model set up, you are now ready to ask some questions. 

Here is an example of the simplest way to ask the model some general questions.

In [None]:
# Run the example
question = "who wrote the book Pride and Prejudice?"
answer = llm(question)

Alternatively, you can use LangChain's `PromptTemplate` for some flexibility in your prompts and questions. For more information on LangChain's prompt template visit this [link](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/)

In [None]:
prompt = PromptTemplate.from_template(
    "who wrote {book}?"
)
chain = LLMChain(llm=llm, prompt=prompt)
answer = chain.run("A tale of two cities")