TRL documentation
Distributing Training
Distributing Training
Multi-GPU Training with TRL
The trainers in TRL use 🤗 Accelerate to enable distributed training across multiple GPUs or nodes. To do so, first create an 🤗 Accelerate config file by running
accelerate config
and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running:
accelerate launch train.py
We also provide config files in the examples folder that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>
This automatically distributes the workload across all available GPUs.
Under the hood, 🤗 Accelerate creates one model per GPU. Each process:
- Processes its own batch of data
- Computes the loss and gradients for that batch
- Shares gradient updates across all GPUs
The effective batch size is calculated as:
To maintain a consistent batch size when scaling to multiple GPUs, make sure to update per_device_train_batch_size
and gradient_accumulation_steps
accordingly.
Example, these configurations are equivalent, and should yield the same results:
Number of GPUs | Per device batch size | Gradient accumulation steps | Comments |
---|---|---|---|
1 | 32 | 1 | Possibly high memory usage, but faster training |
1 | 4 | 8 | Lower memory usage, slower training |
8 | 4 | 1 | Multi-GPU to get the best of both worlds |
Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage DeepSpeed, which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our DeepSpeed Integration guide for more details.
Multi-Nodes Training
We’re working on a guide for multi-node training. Stay tuned! 🚀
< > Update on GitHub