TRL documentation

Distributing Training

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.15.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Distributing Training

Section under construction. Feel free to contribute!

Multi-GPU Training with TRL

The trainers in TRL use 🤗 Accelerate to enable distributed training across multiple GPUs or nodes. To do so, first create an 🤗 Accelerate config file by running

accelerate config

and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running:

accelerate launch train.py

We also provide config files in the examples folder that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:

accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>

This automatically distributes the workload across all available GPUs.

Under the hood, 🤗 Accelerate creates one model per GPU. Each process:

  • Processes its own batch of data
  • Computes the loss and gradients for that batch
  • Shares gradient updates across all GPUs

The effective batch size is calculated as: Batch Size=per_device_train_batch_size×num_devices×gradient_accumulation_steps \text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps}

To maintain a consistent batch size when scaling to multiple GPUs, make sure to update per_device_train_batch_size and gradient_accumulation_steps accordingly.

Example, these configurations are equivalent, and should yield the same results:

Number of GPUs Per device batch size Gradient accumulation steps Comments
1 32 1 Possibly high memory usage, but faster training
1 4 8 Lower memory usage, slower training
8 4 1 Multi-GPU to get the best of both worlds

Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage DeepSpeed, which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our DeepSpeed Integration guide for more details.

Multi-Nodes Training

We’re working on a guide for multi-node training. Stay tuned! 🚀

< > Update on GitHub