Understanding NVIDIA NCCL and Its Collectives

What is NCCL?

The NVIDIA Collective Communications Library (NCCL, pronounced "Nickel") is a library that provides routines for collective communication, specifically optimized for NVIDIA GPUs. These routines are designed to achieve high bandwidth and low latency, which are critical for performance in multi-GPU and multi-node deep learning training and high-performance computing (HPC) applications.

NCCL supports a variety of NVIDIA interconnect technologies, such as NVLink™ and NVSwitch™, PCIe, and network interfaces like InfiniBand when used with GPUDirect RDMA. It efficiently handles data transfers between GPUs, whether they are within the same node or across different nodes in a cluster.

Collective Operations

Collective operations are routines that involve a group of communicating processes (in NCCL's context, these are typically GPUs within a defined communicator). Each process in the group participates by contributing data, receiving data, or both. These operations are fundamental building blocks for many parallel algorithms, especially in distributed training where model parameters or gradients need to be shared and aggregated across multiple GPUs.

NCCL implements several standard collective communication primitives. We will explore some of the most common ones: Broadcast, Reduce, All-Reduce, and All-Gather.

Broadcast

The ncclBroadcast operation copies data from a single GPU (the "root" GPU) to all other GPUs in the communication group. This is useful for distributing initial model parameters or any data that needs to be identical across all participating GPUs.

Input: Data is present in the sendbuff on the root GPU. All GPUs provide a recvbuff.
Output: The recvbuff on every GPU (including the root) is filled with the data from the root's sendbuff.
Operation: recvbuff[i] = sendbuff_root[i] for all GPUs.

Broadcast Diagram (4 GPUs, GPU 0 is root)

GPU 0 (Root)
SendBuff: A
RecvBuff: ?

GPU 1
SendBuff: -
RecvBuff: ?

GPU 2
SendBuff: -
RecvBuff: ?

GPU 3
SendBuff: -
RecvBuff: ?

Before Broadcast

↓ NCCL Broadcast Operation ↓

GPU 0 (Root)
SendBuff: A
RecvBuff: A

GPU 1
SendBuff: -
RecvBuff: A

GPU 2
SendBuff: -
RecvBuff: A

GPU 3
SendBuff: -
RecvBuff: A

After Broadcast: Data 'A' from GPU 0 is copied to all RecvBuffs.

Reduce

The ncclReduce operation takes input data from all GPUs in the group, performs an element-wise reduction (e.g., sum, product, min, max, average) on this data, and stores the final reduced result on a single specified root GPU. This is commonly used to aggregate gradients calculated on different GPUs.

Input: Each GPU provides data in its sendbuff. The root GPU provides a recvbuff.
Output: The recvbuff on the root GPU contains the element-wise reduced result. The recvbuff on other GPUs is not modified (unless it's the same as their sendbuff, which is not typical for non-root GPUs in a reduce).
Operation (e.g., Sum): recvbuff_root[i] = sendbuff_gpu0[i] + sendbuff_gpu1[i] + ... + sendbuff_gpuN-1[i].

Reduce Diagram (4 GPUs, GPU 0 is root, Op: Sum)

GPU 0 (Root)
SendBuff: A0
RecvBuff: ?

GPU 1
SendBuff: A1
RecvBuff: -

GPU 2
SendBuff: A2
RecvBuff: -

GPU 3
SendBuff: A3
RecvBuff: -

Before Reduce

↓ NCCL Reduce Operation (Sum) ↓

GPU 0 (Root)
SendBuff: A0
RecvBuff: S

GPU 1
SendBuff: A1
RecvBuff: -

GPU 2
SendBuff: A2
RecvBuff: -

GPU 3
SendBuff: A3
RecvBuff: -

After Reduce: GPU 0's RecvBuff contains S = A0+A1+A2+A3.

All-Reduce

The ncclAllReduce operation is similar to Reduce, but with one key difference: the final reduced result is distributed to *all* GPUs in the communication group, not just the root. It's conceptually equivalent to a Reduce operation followed by a Broadcast of the result. This is very common for synchronizing gradients in data-parallel training, where every GPU needs the globally averaged gradients.

Input: Each GPU provides data in its sendbuff. All GPUs provide a recvbuff.
Output: The recvbuff on every GPU contains the same element-wise reduced result.
Operation (e.g., Sum): recvbuff_all_gpus[i] = sendbuff_gpu0[i] + sendbuff_gpu1[i] + ... + sendbuff_gpuN-1[i].

All-Reduce Diagram (4 GPUs, Op: Sum)

GPU 0
SendBuff: A0
RecvBuff: ?

GPU 1
SendBuff: A1
RecvBuff: ?

GPU 2
SendBuff: A2
RecvBuff: ?

GPU 3
SendBuff: A3
RecvBuff: ?

Before All-Reduce

↓ NCCL All-Reduce Operation (Sum) ↓

GPU 0
SendBuff: A0
RecvBuff: S

GPU 1
SendBuff: A1
RecvBuff: S

GPU 2
SendBuff: A2
RecvBuff: S

GPU 3
SendBuff: A3
RecvBuff: S

After All-Reduce: All RecvBuffs contain S = A0+A1+A2+A3.

All-Gather

The ncclAllGather operation collects data from all GPUs and distributes the combined (concatenated) data to all GPUs. Each GPU sends the content of its sendbuff to every other GPU. Each GPU then stores the data received from all GPUs, concatenated in rank order, into its recvbuff. The size of the recvbuff must be large enough to hold data from all participating GPUs (i.e., `nranks * count` elements).

Input: Each GPU provides its segment of data in its sendbuff. All GPUs provide a recvbuff.
Output: The recvbuff on every GPU contains the concatenation of data from all GPUs, ordered by rank. For example, recvbuff_gpu_k = {sendbuff_gpu0, sendbuff_gpu1, ..., sendbuff_gpuN-1}.

All-Gather Diagram (4 GPUs)

GPU 0
SendBuff: D0
RecvBuff: ?

GPU 1
SendBuff: D1
RecvBuff: ?

GPU 2
SendBuff: D2
RecvBuff: ?

GPU 3
SendBuff: D3
RecvBuff: ?

Before All-Gather

↓ NCCL All-Gather Operation ↓

GPU 0
SendBuff: D0
RecvBuff: D0 D1 D2 D3

GPU 1
SendBuff: D1
RecvBuff: D0 D1 D2 D3

GPU 2
SendBuff: D2
RecvBuff: D0 D1 D2 D3

GPU 3
SendBuff: D3
RecvBuff: D0 D1 D2 D3

After All-Gather: All RecvBuffs contain the concatenated data from all GPUs.