|
@@ -46,7 +46,7 @@ In this scenario depending on the model size, you might need to go beyond one GP
|
|
|
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
|
|
|
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
|
|
|
|
|
|
-**FSDP (FUlly Sharded Data Parallel)**
|
|
|
+**FSDP (Fully Sharded Data Parallel)**
|
|
|
|
|
|
|
|
|
Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
|