marx/llama-recipes @ 4b166f4e6803855cdb63b602b58f288c7630ad4c

This demo app shows:

How to use LangChain's YoutubeLoader to retrieve the caption in a YouTube video
How to ask Llama to summarize the content (per the Llama's input size limit) of the video in a naive way using LangChain's stuff method
How to bypass the limit of Llama's max input token size by using a more sophisticated way using LangChain's map_reduce and refine methods - see here for more info

We start by installing the necessary packages:

youtube-transcript-api API to get transcript/subtitles of a YouTube video
langchain provides necessary RAG tools for this demo
tiktoken BytePair Encoding tokenizer
pytube Utility for downloading YouTube videos

Note This example uses OctoAI to host the Llama model. If you have not set up/or used OctoAI before, we suggest you take a look at the HelloLlamaCloud example for information on how to set up OctoAI before continuing with this example. If you do not want to use OctoAI, you will need to make some changes to this notebook as you go along.

!pip install langchain octoai-sdk youtube-transcript-api tiktoken pytube

Let's load the YouTube video transcript using the YoutubeLoader.

from langchain.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=1k37OcjH7BM", add_video_info=True
)

# load the youtube video caption into Documents
docs = loader.load()

# check the docs length and content
len(docs[0].page_content), docs[0].page_content[:300]

We are using OctoAI in this example to host our Llama 2 model so you will need to get a OctoAI token.

To get the OctoAI token:

You will need to first sign in with OctoAI with your github account
Then create a free API token here that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)

Note After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.

Alternatively, you can run Llama locally. See:

HelloLlamaLocal for further information on how to run Llama locally.

# enter your OctoAI API token, or you can use local Llama. See README for more info
from getpass import getpass
import os

OCTOAI_API_TOKEN = getpass()
os.environ["OCTOAI_API_TOKEN"] = OCTOAI_API_TOKEN

Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the OctoAI text generation solution page.

At the time of writing this notebook the following Llama models are available on OctoAI:

llama-2-13b-chat-fp16
llama-2-70b-chat-int4
llama-2-70b-chat-fp16
codellama-7b-instruct-fp16
codellama-13b-instruct-fp16
codellama-34b-instruct-int4
codellama-34b-instruct-fp16
codellama-70b-instruct-fp16

If you using local Llama, just set llm accordingly - see the HelloLlamaLocal notebook

from langchain.llms.octoai_endpoint import OctoAIEndpoint

llama2_13b = "llama-2-13b-chat-fp16"
llm = OctoAIEndpoint(
    endpoint_url="https://text.octoai.run/v1/chat/completions",
    model_kwargs={
        "model": llama2_13b,
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful, respectful and honest assistant."
            }
        ],
        "max_tokens": 500,
        "top_p": 1,
        "temperature": 0.01
    },
)

Once everything is set up, we prompt Llama 2 to summarize the first 4000 characters of the transcript for us.

from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
prompt = ChatPromptTemplate.from_template(
    "Give me a summary of the text below: {text}?"
)
chain = LLMChain(llm=llm, prompt=prompt)
# be careful of the input text length sent to LLM
text = docs[0].page_content[:4000]
summary = chain.run(text)
# this is the summary of the first 4000 characters of the video content
print(summary)

Next we try to summarize all the content of the transcript and we should get a RuntimeError: Your input is too long. Max input length is 4096 tokens, but you supplied 5597 tokens..

# try to get a summary of the whole content
text = docs[0].page_content
summary = chain.run(text)
print(summary)

Let's try some workarounds to see if we can summarize the entire transcript without running into the RuntimeError.

We will use the LangChain's load_summarize_chain and play around with the chain_type.

from langchain.chains.summarize import load_summarize_chain
# see https://python.langchain.com/docs/use_cases/summarization for more info
chain = load_summarize_chain(llm, chain_type="stuff") # other supported methods are map_reduce and refine
chain.run(docs)
# same RuntimeError: Your input is too long. but stuff works for shorter text with input length <= 4096 tokens

chain = load_summarize_chain(llm, chain_type="refine")
# still get the "RuntimeError: Your input is too long. Max input length is 4096 tokens"
chain.run(docs)

Since the transcript is bigger than the model can handle, we can split the transcript into chunks instead and use the refine chain_type to iteratively create an answer.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# we need to split the long input text
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=3000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

# check the splitted docs lengths
len(split_docs), len(docs), len(split_docs[0].page_content), len(docs[0].page_content)

# now get the summary of the whole docs - the whole youtube content
chain = load_summarize_chain(llm, chain_type="refine")
print(str(chain.run(split_docs)))

You can also use map_reduce chain_type to implement a map reduce like architecture while summarizing the documents.

# another method is map_reduce
chain = load_summarize_chain(llm, chain_type="map_reduce")
print(str(chain.run(split_docs)))

To investigate further, let's turn on Langchain's debug mode on to get an idea of how many calls are made to the model and the details of the inputs and outputs. We will then run our summary using the stuff and refine chain_types and take a look at our output.

# to find how many calls to Llama have been made and the details of inputs and outputs of each call, set langchain to debug
import langchain
langchain.debug = True

# stuff method will cause the error in the end
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(split_docs)

# but refine works
chain = load_summarize_chain(llm, chain_type="refine")
chain.run(split_docs)

As you can see, stuff fails because it tries to treat all the split documents as one and "stuffs" it into one prompt which leads to a much larger prompt than Llama 2 can handle while refine iteratively runs over the documents updating its answer as it goes.

VideoSummary.ipynb 12 KB Histórico Em bruto

This demo app shows:

VideoSummary.ipynb 12 KB

Histórico Em bruto