Optimized Primitives for Multi-GPU Communication
The NVIDIA Collective Communications Library (NCCL, pronounced "Nickel") is a library that provides routines for collective communication, specifically optimized for NVIDIA GPUs. These routines are designed to achieve high bandwidth and low latency, which are critical for performance in multi-GPU and multi-node deep learning training and high-performance computing (HPC) applications.
NCCL supports a variety of NVIDIA interconnect technologies, such as NVLink™ and NVSwitch™, PCIe, and network interfaces like InfiniBand when used with GPUDirect RDMA. It efficiently handles data transfers between GPUs, whether they are within the same node or across different nodes in a cluster.
Collective operations are routines that involve a group of communicating processes (in NCCL's context, these are typically GPUs within a defined communicator). Each process in the group participates by contributing data, receiving data, or both. These operations are fundamental building blocks for many parallel algorithms, especially in distributed training where model parameters or gradients need to be shared and aggregated across multiple GPUs.
NCCL implements several standard collective communication primitives. We will explore some of the most common ones: Broadcast, Reduce, All-Reduce, and All-Gather.
The ncclBroadcast
operation copies data from a single GPU (the "root" GPU) to all other GPUs in the communication group. This is useful for distributing initial model parameters or any data that needs to be identical across all participating GPUs.
sendbuff
on the root GPU. All GPUs provide a recvbuff
.recvbuff
on every GPU (including the root) is filled with the data from the root's sendbuff
.recvbuff[i] = sendbuff_root[i]
for all GPUs.
The ncclReduce
operation takes input data from all GPUs in the group, performs an element-wise reduction (e.g., sum, product, min, max, average) on this data, and stores the final reduced result on a single specified root GPU. This is commonly used to aggregate gradients calculated on different GPUs.
sendbuff
. The root GPU provides a recvbuff
.recvbuff
on the root GPU contains the element-wise reduced result. The recvbuff
on other GPUs is not modified (unless it's the same as their sendbuff
, which is not typical for non-root GPUs in a reduce).recvbuff_root[i] = sendbuff_gpu0[i] + sendbuff_gpu1[i] + ... + sendbuff_gpuN-1[i]
.
The ncclAllReduce
operation is similar to Reduce, but with one key difference: the final reduced result is distributed to *all* GPUs in the communication group, not just the root. It's conceptually equivalent to a Reduce operation followed by a Broadcast of the result. This is very common for synchronizing gradients in data-parallel training, where every GPU needs the globally averaged gradients.
sendbuff
. All GPUs provide a recvbuff
.recvbuff
on every GPU contains the same element-wise reduced result.recvbuff_all_gpus[i] = sendbuff_gpu0[i] + sendbuff_gpu1[i] + ... + sendbuff_gpuN-1[i]
.
The ncclAllGather
operation collects data from all GPUs and distributes the combined (concatenated) data to all GPUs. Each GPU sends the content of its sendbuff
to every other GPU. Each GPU then stores the data received from all GPUs, concatenated in rank order, into its recvbuff
. The size of the recvbuff
must be large enough to hold data from all participating GPUs (i.e., `nranks * count` elements).
sendbuff
. All GPUs provide a recvbuff
.recvbuff
on every GPU contains the concatenation of data from all GPUs, ordered by rank. For example, recvbuff_gpu_k = {sendbuff_gpu0, sendbuff_gpu1, ..., sendbuff_gpuN-1}
.