vor 1 Jahr · d9d45335f1
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				     - [Single GPU](#single-gpu)
			
 
				     - [Multi GPU One Node](#multiple-gpus-one-node)
			
 
				     - [Multi GPU Multi Node](#multi-gpu-multi-node)
			
 
				-3. [Inference](./inference/inference.md)
			
 
				+3. [Inference](./docs/inference.md)
			
 
				 4. [Model Conversion](#model-conversion-to-hugging-face)
			
 
				 5. [Repository Organization](#repository-organization)
			
 
				 6. [License and Acceptable Use Policy](#license)
			
@@ -32,7 +32,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 
				 * [Multi-GPU Fine-tuning](./docs/mutli_gpu.md)
			
 
				 * [LLM Fine-tuning](./docs/LLM_finetuning.md)
			
 
				 * [Adding custom datasets](./docs/Dataset.md)
			
 
				-* [Inference](./inference/inference.md)
			
 
				+* [Inference](./docs/inference.md)
			
 
				 * [FAQs](./docs/FAQ.md)
			
 
				 
			
 
				 ## Requirements
			
--- a/docs/Dataset.md
+++ b/docs/Dataset.md
@@ -62,7 +62,7 @@ Below we list other datasets and their main use cases that can be used for fine
 
				 
			
 
				 ### Bias evaluation
			
 
				 - [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
			
 
				-- [WinoGender] gender bias
			
 
				+- WinoGender gender bias
			
 
				 
			
 
				 ### Useful Links
			
 
				 More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)
			
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -17,3 +17,12 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
				 4. Can I add custom datasets?
			
 
				 
			
 
				     Yes, you can find more information on how to do that [here](Dataset.md).
			
 
				+
			
 
				+5. What are the hardware SKU requirements for deploying these models?
			
 
				+
			
 
				+    Hardware requirements vary based on latency, throughput and cost constraints. For good latency, the models were split across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs like A10G, T4, L4, or even commodity hardware can also be used to deploy these models (e.g. https://github.com/ggerganov/llama.cpp).
			
 
				+    If working on a CPU, it is worth looking at this [blog post](https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html) from Intel for an idea of Llama 2's performance on a CPU.
			
 
				+
			
 
				+6. What are the hardware SKU requirements for fine-tuning Llama pre-trained models?
			
 
				+
			
 
				+    Fine-tuning requirements vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types like NVIDIA A10G or H100 are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora).
			
--- a/inference/inference.md
+++ b/inference/inference.md
@@ -1,6 +1,6 @@
 
				 # Inference
			
 
				 
			
 
				-For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
			
 
				+For inference we have provided an [inference script](../inference/inference.py). Depending on the type of finetuning performed during training the [inference script](../inference/inference.py) takes different arguments.
			
 
				 To finetune all model parameters the output dir of the training has to be given as --model_name argument.
			
 
				 In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
			
 
				 Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
			
@@ -41,6 +41,12 @@ Alternate inference options include:
 
				 
			
 
				 [**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
			
 
				 To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
			
 
				-Once installed, you can use the vLLM_ineference.py script provided [here](vLLM_inference.py).
			
 
				+Once installed, you can use the vLLM_ineference.py script provided [here](../inference/vLLM_inference.py).
			
 
				 
			
 
				-[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](https://github.com/huggingface/text-generation-inference).
			
 
				+Below is an example of how to run the vLLM_inference.py script found within the inference folder.
			
 
				+
			
 
				+``` bash
			
 
				+python vLLM_inference.py --model_name <PATH/TO/MODEL/7B>
			
 
				+```
			
 
				+
			
 
				+[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../inference/hf-text-generation-inference/README.md).
			
--- a/inference/README.md
+++ b/inference/README.md
@@ -0,0 +1,11 @@
 
				+# Inference
			
 
				+
			
 
				+This folder contains inference examples for Llama 2. So far, we have provided support for three methods of inference:
			
 
				+
			
 
				+1. [inference script](inference.py) script provides support for Hugging Face accelerate and PEFT fine tuned models.
			
 
				+
			
 
				+2. [vLLM_inference.py](vLLM_inference.py) script takes advantage of vLLM's paged attention concept for low latency.
			
 
				+
			
 
				+3. The [hf-text-generation-inference](hf-text-generation-inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
			
 
				+
			
 
				+For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).