NVIDIA Collective Communications Library (NCCL)

Optimized Primitives for Multi-GPU Communication

What is NCCL?

The NVIDIA Collective Communications Library (NCCL, pronounced "Nickel") is a library that provides routines for collective communication, specifically optimized for NVIDIA GPUs. These routines are designed to achieve high bandwidth and low latency, which are critical for performance in multi-GPU and multi-node deep learning training and high-performance computing (HPC) applications.

NCCL supports a variety of NVIDIA interconnect technologies, such as NVLink™ and NVSwitch™, PCIe, and network interfaces like InfiniBand when used with GPUDirect RDMA. It efficiently handles data transfers between GPUs, whether they are within the same node or across different nodes in a cluster.

Collective Operations

Collective operations are routines that involve a group of communicating processes (in NCCL's context, these are typically GPUs within a defined communicator). Each process in the group participates by contributing data, receiving data, or both. These operations are fundamental building blocks for many parallel algorithms, especially in distributed training where model parameters or gradients need to be shared and aggregated across multiple GPUs.

NCCL implements several standard collective communication primitives. We will explore some of the most common ones: Broadcast, Reduce, All-Reduce, and All-Gather.

Broadcast

The ncclBroadcast operation copies data from a single GPU (the "root" GPU) to all other GPUs in the communication group. This is useful for distributing initial model parameters or any data that needs to be identical across all participating GPUs.

Broadcast Diagram (4 GPUs, GPU 0 is root)

GPU 0 (Root)
SendBuff: A
RecvBuff: ?
GPU 1
SendBuff: -
RecvBuff: ?
GPU 2
SendBuff: -
RecvBuff: ?
GPU 3
SendBuff: -
RecvBuff: ?
Before Broadcast
NCCL Broadcast Operation
GPU 0 (Root)
SendBuff: A
RecvBuff: A
GPU 1
SendBuff: -
RecvBuff: A
GPU 2
SendBuff: -
RecvBuff: A
GPU 3
SendBuff: -
RecvBuff: A
After Broadcast: Data 'A' from GPU 0 is copied to all RecvBuffs.

Reduce

The ncclReduce operation takes input data from all GPUs in the group, performs an element-wise reduction (e.g., sum, product, min, max, average) on this data, and stores the final reduced result on a single specified root GPU. This is commonly used to aggregate gradients calculated on different GPUs.

Reduce Diagram (4 GPUs, GPU 0 is root, Op: Sum)

GPU 0 (Root)
SendBuff: A0
RecvBuff: ?
GPU 1
SendBuff: A1
RecvBuff: -
GPU 2
SendBuff: A2
RecvBuff: -
GPU 3
SendBuff: A3
RecvBuff: -
Before Reduce
NCCL Reduce Operation (Sum)
GPU 0 (Root)
SendBuff: A0
RecvBuff: S
GPU 1
SendBuff: A1
RecvBuff: -
GPU 2
SendBuff: A2
RecvBuff: -
GPU 3
SendBuff: A3
RecvBuff: -
After Reduce: GPU 0's RecvBuff contains S = A0+A1+A2+A3.

All-Reduce

The ncclAllReduce operation is similar to Reduce, but with one key difference: the final reduced result is distributed to *all* GPUs in the communication group, not just the root. It's conceptually equivalent to a Reduce operation followed by a Broadcast of the result. This is very common for synchronizing gradients in data-parallel training, where every GPU needs the globally averaged gradients.

All-Reduce Diagram (4 GPUs, Op: Sum)

GPU 0
SendBuff: A0
RecvBuff: ?
GPU 1
SendBuff: A1
RecvBuff: ?
GPU 2
SendBuff: A2
RecvBuff: ?
GPU 3
SendBuff: A3
RecvBuff: ?
Before All-Reduce
NCCL All-Reduce Operation (Sum)
GPU 0
SendBuff: A0
RecvBuff: S
GPU 1
SendBuff: A1
RecvBuff: S
GPU 2
SendBuff: A2
RecvBuff: S
GPU 3
SendBuff: A3
RecvBuff: S
After All-Reduce: All RecvBuffs contain S = A0+A1+A2+A3.

All-Gather

The ncclAllGather operation collects data from all GPUs and distributes the combined (concatenated) data to all GPUs. Each GPU sends the content of its sendbuff to every other GPU. Each GPU then stores the data received from all GPUs, concatenated in rank order, into its recvbuff. The size of the recvbuff must be large enough to hold data from all participating GPUs (i.e., `nranks * count` elements).

All-Gather Diagram (4 GPUs)

GPU 0
SendBuff: D0
RecvBuff: ?
GPU 1
SendBuff: D1
RecvBuff: ?
GPU 2
SendBuff: D2
RecvBuff: ?
GPU 3
SendBuff: D3
RecvBuff: ?
Before All-Gather
NCCL All-Gather Operation
GPU 0
SendBuff: D0
RecvBuff: D0 D1 D2 D3
GPU 1
SendBuff: D1
RecvBuff: D0 D1 D2 D3
GPU 2
SendBuff: D2
RecvBuff: D0 D1 D2 D3
GPU 3
SendBuff: D3
RecvBuff: D0 D1 D2 D3
After All-Gather: All RecvBuffs contain the concatenated data from all GPUs.