Queue Pairs (QPs) Explained
Queue Pairs are the cornerstone of InfiniBand communication, acting as the bridge between software applications and the hardware Host Channel Adapter (HCA). They are virtual interfaces that logically pair Send Queues and Receive Queues, enabling efficient data transfer by minimizing OS overhead. Each QP is uniquely identified by a Queue Pair Number (QPN).
- Send Queue (SQ): Stores send Work Queue Elements (WQEs) detailing data to be transmitted.
- Receive Queue (RQ): Holds receive WQEs, specifying where incoming data should be placed.
- Shared Receive Queue (SRQ): Optional; allows multiple QPs to share a single RQ, conserving memory.
- Completion Queue (CQ): Stores Completion Queue Elements (CQEs) indicating the status of processed WQEs. A CQ provides feedback to the application.
- Queue Pair Context (QPC): A data structure storing operational parameters and state for the QP, managed by the HCA.
The Packet's Journey: InfiniBand Sending Process
Data transmission in InfiniBand is a well-defined sequence, largely offloaded to the Host Channel Adapter (HCA) to achieve low latency and high throughput. Here's how it works:
1. Application & Work Requests (WRs)
Applications initiate data transfers by creating Work Requests (WRs) using the InfiniBand Verbs API (e.g., `ibv_post_send()`). WRs detail the operation (SEND, RDMA WRITE/READ), local data buffers, and remote memory info for RDMA.
2. Work Queue Elements (WQEs)
Posted WRs become Work Queue Elements (WQEs) in the QP's Send Queue (SQ) or Receive Queue (RQ). WQEs are the HCA's task list, containing all parameters for autonomous execution.
3. HCA in Action
The HCA takes over:
- WQE Fetch: Fetches WQEs from the SQ.
- Doorbell Notification: Applications "ring a doorbell" (a memory-mapped write) to signal new WQEs to the HCA.
- Data Segmentation: If data exceeds Path MTU, the HCA segments it into multiple packets (and handles reassembly on receipt).
- Packet Header Construction: Builds InfiniBand packet headers (LRH, GRH, BTH, Extended Headers) using info from WQE and QP Context.
- CRC Calculation: Adds Invariant CRC (ICRC) and Variant CRC (VCRC) for data integrity.
The QP must be in a valid state (e.g., Ready To Send - RTS) for the HCA to process send requests.
4. Physical Transmission
Fully formed packets are passed to the HCA's physical layer for encoding and transmission over the InfiniBand fabric (links and switches).
5. Signaling Completion (CQs)
After processing a WQE, the HCA posts a Completion Queue Element (CQE) to the associated Completion Queue (CQ). The CQE indicates success or failure of the operation. Applications poll the CQ or use event notifications to get this feedback. This asynchronicity allows applications to "fire and forget" requests, overlapping computation and communication.
Visualizing the Send-Receive Flow (Mermaid Diagram)
This diagram illustrates the simplified request path for an InfiniBand send and receive operation using Mermaid.js. It shows the interaction between Applications, Verbs APIs, Kernel Drivers, Host Channel Adapters (HCAs), and the Network.
sequenceDiagram participant SA as Sender App participant SVA as Sender Verbs API participant SKD as Sender Kernel Driver participant SHCA as Sender HCA participant NET as Network Fabric participant RHCA as Receiver HCA participant RKD as Receiver Kernel Driver participant RVA as Receiver Verbs API participant RA as Receiver App rect rgb(254, 243, 199) note over SA, SHCA: Sender Side SA->>SVA: 1. ibv_post_send(WR_send) SVA->>SVA: 2. Prep WQE_send, To SQ, Ring Doorbell SVA->>SKD: 3. Notify HCA SKD->>SHCA: 4. Signal new WQE_send SHCA->>SHCA: 5. Fetch WQE, Get Data, Segment, Build Packet(s) SHCA->>NET: 6. Transmit Packet(s) end rect rgb(209, 250, 229) note over RHCA, RA: Receiver Side (Prior Setup) RA->>RVA: A. (Prior) ibv_post_recv(WR_recv) -> WQE_recv in RQ end NET->>RHCA: 7. Packet(s) Arrive rect rgb(209, 250, 229) note over RHCA, RA: Receiver Side (Processing) RHCA->>RHCA: 8. Validate, Reassemble, Match WQE_recv RHCA->>RA: 9. DMA Data to App Buffer (via MR) RHCA->>RVA: 10. Post CQE_recv to Receiver CQ RVA->>RA: 11. Polls/Event for CQE_recv, Process Data end rect rgb(254, 243, 199) note over SA, SHCA: Sender Side (Completion) SHCA->>SVA: 12. (After ACK for RC / send done) Post CQE_send SVA->>SA: 13. Polls/Event for CQE_send, Process Completion end
This visualization highlights the layered architecture and the distinct roles of sender and receiver components in an InfiniBand communication.
Understanding QP Transport Modes
InfiniBand QPs can be configured with different transport modes to offer varying levels of service for reliability, ordering, and connection management. Choosing the right mode is crucial for application performance.
Key Transport Modes:
- Reliable Connection (RC): Connection-oriented, reliable, in-order delivery (like TCP). Supports all RDMA ops. Common for MPI.
- Unreliable Datagram (UD): Connectionless, unreliable (like UDP). Supports multicast. Lower overhead.
- Unreliable Connection (UC): Connection-oriented, but unreliable. Supports RDMA Write.
- eXtended Reliable Connection (XRC): Scalable RC, allowing one QP to communicate with multiple remote processes.
- Reliable Datagram (RD): Defined, but generally not supported in hardware/software.
RDMA Operations Supported by Mode:
This chart shows the number of RDMA operation types (Write, Read, Atomic) supported by common transport modes.
Transport Mode Feature Comparison:
Feature | Reliable Connection (RC) | Unreliable Connection (UC) | Unreliable Datagram (UD) | eXtended RC (XRC) |
---|---|---|---|---|
Reliability | Yes | No | No | Yes |
Connection | Yes (1-to-1 QP) | Yes (1-to-1 QP) | No (Many-to-Many QPs) | Yes (1-to-N processes) |
Ordering | In-Order | In-Order (within conn.) | Not Guaranteed | In-Order |
Max Msg Size | ≈ 2GB | ≈ 2GB | MTU | ≈ 2GB |
RDMA Write | Yes | Yes | No | Yes |
RDMA Read | Yes | No | No | Yes |
Atomics | Yes | No | No | Yes |
Primary Use | MPI, General Reliable | Custom Protocols | Service Discovery, Scalable Unreliable Comm. | Scalable HPC |
RC offers strong guarantees but has overhead. UD is lightweight but unreliable. XRC aims to balance RC's reliability with better scalability for HPC. The choice depends on specific application needs.