marx/llama-recipes @ b5598e520eed9ebe7b80b40b7a531d2a01a1ee5d

Jeff Tang b5598e520e Merge branch 'main' into demos4llama3v4		1 rok temu
..
benchmarks	367e4869ac Reorg inference throughput folder structure	1 rok temu
code_llama	6d449a859b New folder structure (#1)	1 rok temu
evaluation	84f15fee50 updating the REAMEs to llama3	1 rok temu
finetuning	fda8482c71 Update peft_finetuning.ipynb	1 rok temu
inference	43a28956d1 update llama-on-prem.md to Llama 3 - format fix	1 rok temu
llama_api_providers	79266217ef Update location and name of llm.py example notebook	1 rok temu
multilingual	e98f6de80d typo	1 rok temu
quickstart	6fd1dbfb38 Removed notebook output cells (#466)	1 rok temu
responsible_ai	c1be7d802a Updating responsible AI main readme	1 rok temu
use_cases	b5598e520e Merge branch 'main' into demos4llama3v4	1 rok temu
README.md	0efb8bd31e Update README.md	1 rok temu

This folder contains examples organized by topic:

Subfolder	Description
quickstart	The "Hello World" of using Llama2, start here if you are new to using Llama2
multilingual	Scripts to add a new language to Llama2
finetuning	Scripts to finetune Llama2 on single-GPU and multi-GPU setups
inference	Scripts to deploy Llama2 for inference locally and using model servers
use_cases	Scripts showing common applications of Llama2
responsible_ai	Scripts to use PurpleLlama for safeguarding model outputs
llama_api_providers	Scripts to run inference on Llama via hosted endpoints
benchmarks	Scripts to benchmark Llama 2 models inference on various backends
code_llama	Scripts to run inference with the Code Llama models
evaluation	Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`

Note on using Replicate To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token here that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's Run time and cost for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.

Note on using OctoAI You can also use OctoAI to run some of the Llama demos under OctoAI_API_examples. You can sign into OctoAI with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at $0.00086 per 1k tokens (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).

Running Llama2 in Google Colab

To run Llama2 in Google Colab using llama-cpp-python, download the quantized Llama2-7b-chat model here, or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.

README.md

Running Llama2 in Google Colab