## This demo app shows:
* How to use LangChain's YoutubeLoader to retrieve the caption in a YouTube video
* How to ask Llama to summarize the content (per the Llama's input size limit) of the video in a naive way using LangChain's stuff method
* How to bypass the limit of Llama's max input token size by using a more sophisticated way using LangChain's map_reduce and refine methods - see [here](https://python.langchain.com/docs/use_cases/summarization) for more info
We start by installing the necessary packages:
- [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/) API to get transcript/subtitles of a YouTube video
- [langchain](https://python.langchain.com/docs/get_started/introduction) provides necessary RAG tools for this demo
- [tiktoken](https://github.com/openai/tiktoken) BytePair Encoding tokenizer
- [pytube](https://pytube.io/en/latest/) Utility for downloading YouTube videos
**Note** This example uses OctoAI to host the Llama model. If you have not set up/or used OctoAI before, we suggest you take a look at the [HelloLlamaCloud](HelloLlamaCloud.ipynb) example for information on how to set up OctoAI before continuing with this example.
If you do not want to use OctoAI, you will need to make some changes to this notebook as you go along.
Let's load the YouTube video transcript using the YoutubeLoader.
We are using OctoAI in this example to host our Llama 2 model so you will need to get a OctoAI token.
To get the OctoAI token:
- You will need to first sign in with OctoAI with your github account
- Then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)
**Note** After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.
Alternatively, you can run Llama locally. See:
- [HelloLlamaLocal](HelloLlamaLocal.ipynb) for further information on how to run Llama locally.
Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).
At the time of writing this notebook the following Llama models are available on OctoAI:
* llama-2-13b-chat-fp16
* llama-2-70b-chat-fp16
* codellama-7b-instruct-fp16
* codellama-13b-instruct-fp16
* codellama-34b-instruct-fp16
* codellama-70b-instruct-fp16
If you using local Llama, just set llm accordingly - see the [HelloLlamaLocal notebook](HelloLlamaLocal.ipynb)
Once everything is set up, we prompt Llama 2 to summarize the first 4000 characters of the transcript for us.
Next we try to summarize all the content of the transcript and we should get a `RuntimeError: Your input is too long. Max input length is 4096 tokens, but you supplied 5597 tokens.`.
Let's try some workarounds to see if we can summarize the entire transcript without running into the `RuntimeError`.
We will use the LangChain's `load_summarize_chain` and play around with the `chain_type`.
Since the transcript is bigger than the model can handle, we can split the transcript into chunks instead and use the [`refine`](https://python.langchain.com/docs/modules/chains/document/refine) `chain_type` to iteratively create an answer.
You can also use [`map_reduce`](https://python.langchain.com/docs/modules/chains/document/map_reduce) `chain_type` to implement a map reduce like architecture while summarizing the documents.
To investigate further, let's turn on Langchain's debug mode on to get an idea of how many calls are made to the model and the details of the inputs and outputs.
We will then run our summary using the `stuff` and `refine` `chain_types` and take a look at our output.
As you can see, `stuff` fails because it tries to treat all the split documents as one and "stuffs" it into one prompt which leads to a much larger prompt than Llama 2 can handle while `refine` iteratively runs over the documents updating its answer as it goes.