Browse Source

Merge branch 'checkpoint_handler_path_fix' of https://github.com/facebookresearch/llama-recipes into checkpoint_handler_path_fix

Hamid Shojanazeri 1 year ago
parent
commit
017cadd04b

+ 79 - 0
.github/ISSUE_TEMPLATE/bug.yml

@@ -0,0 +1,79 @@
+name: 🐛 Bug Report
+description: Create a report to help us reproduce and fix the bug
+
+body:
+  - type: markdown
+    attributes:
+      value: >
+        #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
+        existing and past issues](https://github.com/facebookresearch/llama-recipes/issues), the [FAQ](https://github.com/facebookresearch/llama-recipes/blob/main/docs/FAQ.md) 
+
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info
+      description: |
+        Please share your system info with us. You can use the following command to capture your environment information
+        python -m "torch.utils.collect_env"
+
+      placeholder: | 
+        PyTorch version, CUDA version, GPU type, #num of GPUs...   
+    validations:
+      required: true
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information
+      description: 'The problem arises when using:'
+      options:
+        - label: "The official example scripts"
+        - label: "My own modified scripts"
+
+  - type: textarea
+    id: bug-description
+    attributes:
+      label: 🐛 Describe the bug
+      description: |
+        Please provide a clear and concise description of what the bug is.
+
+        Provide the exact command(s) that you ran with the settings eg using FSDP and PEFT or pure FSDP.
+        
+        Please also paste or describe the results you observe instead of the expected results. 
+      placeholder: |
+        A clear and concise description of what the bug is.
+        
+        ```python
+        # Command that you used for running the examples
+        ```
+        Description of the results
+    validations:
+      required: true
+
+  - type: textarea
+    attributes:
+      label: Error logs
+      description: |
+       If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
+
+      placeholder: |
+        ```
+        The error message you got, with the full traceback.
+        ```
+
+    validations:
+      required: true
+
+  
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior
+      description: "A clear and concise description of what you would expect to happen."
+
+  - type: markdown
+    attributes:
+      value: >
+        Thanks for contributing 🎉!

+ 31 - 0
.github/ISSUE_TEMPLATE/feature-request.yml

@@ -0,0 +1,31 @@
+name: 🚀 Feature request
+description: Submit a proposal/request for a new llama-recipes feature
+
+body:
+- type: textarea
+  id: feature-pitch
+  attributes:
+    label: 🚀 The feature, motivation and pitch
+    description: >
+      A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
+  validations:
+    required: true
+
+- type: textarea
+  id: alternatives
+  attributes:
+    label: Alternatives
+    description: >
+      A description of any alternative solutions or features you've considered, if any.
+
+- type: textarea
+  id: additional-context
+  attributes:
+    label: Additional context
+    description: >
+      Add any other context or screenshots about the feature request.
+
+- type: markdown
+  attributes:
+    value: >
+      Thanks for contributing 🎉!

+ 38 - 0
.github/PULL_REQUEST_TEMPLATE.md

@@ -0,0 +1,38 @@
+# What does this PR do?
+
+<!--
+Congratulations! You've made it this far! You're not quite done yet though.
+
+Please include a good title that fully reflects the extent of your awesome contribution.
+
+Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
+
+-->
+
+<!-- Remove if not applicable -->
+
+Fixes # (issue)
+
+
+## Feature/Issue validation/testing
+
+Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
+Please also list any relevant details for your test configuration.
+
+- [ ] Test A
+Logs for Test A
+
+- [ ] Test B
+Logs for Test B
+
+
+## Before submitting
+- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
+- [ ] Did you read the [contributor guideline](https://github.com/facebookresearch/llama-recipes/blob/main/CONTRIBUTING.md#pull-requests),
+      Pull Request section?
+- [ ] Was this discussed/approved via a Github issue? Please add a link
+      to it if that's the case.
+- [ ] Did you make sure to update the documentation with your changes?  
+- [ ] Did you write any new necessary tests?
+
+Thanks for contributing 🎉!

+ 66 - 0
.github/workflows/spellcheck.yml

@@ -0,0 +1,66 @@
+name: SpellCheck
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+jobs:
+  build:
+    runs-on: ubuntu-20.04
+    name: Lint changed files
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0  # OR "2" -> To retrieve the preceding commit.
+
+      - name: Check links in all markdown files
+        uses: gaurav-nelson/github-action-markdown-link-check@1.0.13
+        with:
+          use-verbose-mode: 'yes'
+          config-file: "scripts/markdown_link_check_config.json"
+
+      - name: Get changed files
+        id: changed-files
+        uses: tj-actions/changed-files@v29.0.4
+        with:
+
+          files: |
+            **/*.py
+
+  spellcheck:
+    runs-on: ubuntu-20.04
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Install dependencies
+        run: |
+          sudo apt-get install aspell aspell-en
+          pip install pyspelling
+
+      - name: Get changed files
+        id: changed-files
+        uses: tj-actions/changed-files@v29.0.4
+        with:
+          files: |
+            **/*.md
+
+      - name: Check spellings
+        run: |
+          sources=""
+          for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
+            sources="${sources} -S $file"
+          done
+          if [ ! "$sources" ]; then
+            echo "No files to spellcheck"
+          else
+            pyspelling -c $GITHUB_WORKSPACE/scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources
+          fi
+
+      - name: In the case of misspellings
+        if: ${{ failure() }}
+        run: |
+          echo "Please fix the misspellings. If you are sure about some of them, "
+          echo "so append those to scripts/spellcheck_conf/wordlist.txt"

+ 1 - 1
configs/fsdp.py

@@ -13,7 +13,7 @@ class fsdp_config:
     sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD
     sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD
     checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT  # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
     checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT  # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
     fsdp_activation_checkpointing: bool=True
     fsdp_activation_checkpointing: bool=True
-    pure_bf16: bool = True
+    pure_bf16: bool = False
     optimizer: str= "AdamW"
     optimizer: str= "AdamW"
     
     
     
     

+ 2 - 2
docs/Dataset.md

@@ -10,7 +10,7 @@ The provided fine tuning script allows you to select between three datasets by p
 
 
 The list of available datasets can easily be extended with custom datasets by following these instructions.
 The list of available datasets can easily be extended with custom datasets by following these instructions.
 
 
-Each dataset has a corresponding configuration (dataclass) in [configs/dataset.py](../configs/dataset.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
 
 
 Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
 Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
 The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
 The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
@@ -18,7 +18,7 @@ For CausalLM models this usually means that the data needs to be in the form of
 
 
 To add a custom dataset the following steps need to be performed.
 To add a custom dataset the following steps need to be performed.
 
 
-1. Create a dataset configuration after the schema described above. Examples can be found in [configs/dataset.py](../configs/dataset.py).
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../configs/datasets.py).
 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
 3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
 3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
 4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.
 4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.

+ 3 - 0
scripts/markdown_link_check_config.json

@@ -19,6 +19,9 @@
     },
     },
     {
     {
       "pattern": "^http(s)?://localhost.*"
       "pattern": "^http(s)?://localhost.*"
+    },
+    {
+      "pattern": "https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html"
     }
     }
   ]
   ]
 }
 }

+ 1 - 0
utils/memory_utils.py

@@ -50,6 +50,7 @@ class MemoryTrace:
         self.end = byte2gb(torch.cuda.memory_allocated())
         self.end = byte2gb(torch.cuda.memory_allocated())
         self.peak = byte2gb(torch.cuda.max_memory_allocated())
         self.peak = byte2gb(torch.cuda.max_memory_allocated())
         cuda_info = torch.cuda.memory_stats()
         cuda_info = torch.cuda.memory_stats()
+        self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
         self.cuda_malloc_retires = cuda_info.get("num_alloc_retries", 0)
         self.cuda_malloc_retires = cuda_info.get("num_alloc_retries", 0)
         self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
         self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
         self.m_cuda_ooms = cuda_info.get("num_ooms", 0)
         self.m_cuda_ooms = cuda_info.get("num_ooms", 0)

+ 0 - 1
utils/train_utils.py

@@ -130,7 +130,6 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
             print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
             print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
             print(f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB")
             print(f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB")
         
         
-        
         # Update the learning rate as needed
         # Update the learning rate as needed
         lr_scheduler.step()
         lr_scheduler.step()