Architecting High-Performance Shared File Systems

For NVIDIA H100 GPU Clusters with NVMe Over Fabrics

This infographic synthesizes key findings from the technical report on designing and optimizing NVMe-oF based shared file systems to meet the extreme I/O demands of NVIDIA H100 GPU clusters in HPC and AI environments.

The I/O Gauntlet: NVIDIA H100's Demand for Data

NVIDIA H100 GPUs (Hopper architecture) deliver groundbreaking computational power (Report Sec 2.1). However, this performance creates an insatiable appetite for data, pushing traditional storage to its limits. Without an equally advanced storage infrastructure, these powerful GPUs can be starved, leading to underutilization and diminished ROI.

PCIe Gen5 x16

128 GB/s

Bidirectional bandwidth to host system (Report Sec 2.1).

4th Gen NVLink

900 GB/s

Bidirectional GPU-to-GPU bandwidth (Report Sec 2.1).

HBM3 Memory

Up to 3.9 TB/s

Memory bandwidth (PCIe H100 NVL 94GB) (Report Sec 2.1).

These figures underscore the H100's capacity for rapid data ingestion and processing. Advanced storage solutions, particularly NVMe Over Fabrics (NVMe-oF) coupled with Parallel File Systems (PFS), are crucial to feed these data-hungry accelerators and unlock their full potential (Report Sec 1).

Core Technologies: NVMe-oF & Parallel File Systems

NVMe Over Fabrics (NVMe-oF)

NVMe-oF extends the low-latency, high-performance NVMe command set across network fabrics, enabling remote access to NVMe SSDs with near-local performance (Report Sec 2.2). This facilitates:

  • Storage Disaggregation: Scale compute and storage independently.
  • Improved Resource Utilization: Share centralized storage pools.
  • Centralized Management: Simplify storage administration.

Common transports include NVMe/TCP, NVMe/RDMA (RoCE, InfiniBand), and NVMe/FC (Report Sec 2.2).

Parallel File Systems (PFS)

PFS are designed for high-speed, concurrent access to a unified file system namespace from many compute nodes, essential for large, shared datasets in HPC/AI (Report Sec 2.3). Key components typically include:

Metadata Servers (MDS)Manage namespace, directories, file layouts.
Data/Object Storage Servers (OSS/OST)Store actual file data, often striped across servers.
ClientsSoftware on compute nodes for POSIX access.

PFS solutions like Lustre, BeeGFS, and Spectrum Scale layer atop NVMe-oF to provide scalable storage (Report Sec 2.3, 3.2).

Architectural Blueprint: Key Storage Decisions

Choosing Your NVMe-oF Transport Protocol

The selection of an NVMe-oF transport (TCP, RoCE, InfiniBand) is pivotal, impacting performance, cost, and complexity (Report Sec 3.1). This chart compares key aspects based on data from Table 1 of the report. For Latency, CPU Overhead, Config. Complexity, and Cost Implications, lower scores (shorter bars) are generally better. For Max Throughput, higher scores (longer bars) are better.

Consider latency needs, existing infrastructure, budget, and management expertise. RDMA options (RoCE, InfiniBand) offer lower latency, while NVMe/TCP provides operational simplicity (Report Sec 2.2, 3.1).

Selecting Your Parallel File System

Various PFS options (Lustre, BeeGFS, IBM Spectrum Scale, Ceph) can leverage NVMe-oF (Report Sec 3.2). This radar chart compares them on key features from Table 2 of the report. Scores are on a 1-5 scale, where higher is generally better.

Selection depends on scalability, metadata performance, GDS support, ease of management, data protection, and licensing (Report Sec 3.2).

The Power of NVIDIA GPUDirect Storage (GDS)

GDS creates a direct data path between GPU memory and storage (local NVMe or remote NVMe-oF), bypassing the CPU bounce buffer. This significantly reduces latency, increases bandwidth, and lowers CPU overhead, crucial for H100 performance (Report Sec 3.3).

Traditional I/O Path (Without GDS)

Storage (NVMe/NVMe-oF)
CPU System Memory (Bounce Buffer)
GPU Memory

Introduces latency & CPU overhead.

GPUDirect Storage Path

Storage (NVMe/NVMe-oF)
(Direct DMA)
GPU Memory

Bypasses CPU, reduces latency, increases bandwidth, lowers CPU load.

Effective GDS requires support from the PFS client, storage system/NVMe-oF target, and relevant drivers (Report Sec 3.3, 5.4).

Implementation Roadmap Overview

Building an NVMe-oF shared file system involves configuring the network, setting up NVMe-oF targets and initiators, and deploying the PFS (Report Sec 4).

1.Network Fabric Configuration

Configuration is highly dependent on the chosen NVMe-oF transport (Report Sec 4.1):

NVMe/RoCE (Lossless Ethernet)

  • Requires Data Center Bridging (DCB): PFC, ECN.
  • RDMA-capable NICs (rNICs) & DCB-capable switches.
  • Complex but high performance.

NVMe/InfiniBand

  • Inherently lossless, native RDMA.
  • Subnet Manager (SM) required.
  • Dedicated IB HCAs & switches. Ultra-low latency.

NVMe/TCP

  • Standard Ethernet NICs & switches.
  • No lossless fabric requirement. Simpler deployment.
  • Jumbo Frames (MTU 9000) recommended.

2.NVMe-oF Target & Initiator Setup

Targets expose storage; H100 initiators connect to it (Report Sec 4.2, 4.3).

H100 Node (Initiator)Load modules, `nvme-cli`
`nvme discover` `nvme connect` `nvme list`
Storage Server (Target)Expose NVMe namespaces (Linux Kernel Target, PFS Gateway)

Key steps: Install `nvme-cli`, discover targets, connect to subsystems, verify. Configure multipathing and persistence for HA.

3.Parallel File System Deployment

Layer the chosen PFS atop the NVMe-oF infrastructure (Report Sec 4.4):

  1. Install PFS server software (MDS, OSS/Storage Servers).
  2. Configure PFS servers to use NVMe-oF devices (now local to them) as backend.
  3. Install PFS client software on H100 compute nodes.
  4. Create the PFS namespace and mount it on H100 clients.

Specifics vary by PFS. PFS servers act as NVMe-oF initiators to shared NVMe-oF targets (e.g., EBOFs or storage arrays).

Optimizing for Peak Performance & Reliability

Achieving optimal performance requires tuning at multiple levels, from PFS clients to system configurations for GDS, and robust monitoring (Report Sec 5).

PFS Client Tuning (H100 Nodes)

  • Adjust I/O request sizes, read-ahead, concurrency.
  • Ensure NUMA-awareness (pin I/O threads & buffers locally).
  • Optimize client-side caching (consider GDS interaction).
  • Use appropriate PFS-specific mount options. (Sec 5.1)

Leveraging Local NVMe (H100 Nodes)

  • Use as cache for frequently read data (e.g., DDN Hot Nodes).
  • Act as burst buffer for temporary I/O.
  • Stage input datasets for lowest read latency. (Sec 5.2)

NUMA Affinity & I/O Path

  • Pin apps, GPU contexts, NIC interrupts, memory to same NUMA node.
  • Ensure GPU & NVMe-oF NIC are on same CPU socket/PCIe root for GDS. (Sec 5.3)

GPUDirect Storage (GDS) Best Practices

  • Verify stack compatibility (PFS, drivers, hardware).
  • Disable PCIe ACS on relevant bridges; check IOMMU settings.
  • Register GPU memory buffers (`cuFileBufRegister`).
  • Use `gdscheck` and `gdsio` for verification. (Sec 5.4)

Monitoring & Troubleshooting Key Metrics

Monitor across layers (Sec 5.5):

  • NVMe-oF: Latency, IOPS, Throughput, Queue Depths.
  • Network: Bandwidth, Errors, PFC/ECN stats (RoCE).
  • PFS: Client/Server stats, Metadata ops.
  • H100 Clients: CPU/GPU utilization, Memory.

The Future is Fast: Concluding Thoughts & Outlook

Shared file systems using NVMe-oF are pivotal for H100 performance, enabling storage disaggregation and direct GPU data paths via GDS. While implementation requires careful planning, the benefits are substantial. The landscape continues to evolve (Report Sec 6):

NVMe-oF Transport Evolution

NVMe/TCP performance improvements, simplified RoCE, next-gen InfiniBand.

Deeper PFS & GDS Integration

Streamlined configurations, intelligent data placement aware of GPU/fabric topology.

Compute Express Link (CXL)

Cache-coherent memory sharing, new paradigms for fabric-attached memory/storage tiers.

AI for Storage Management (AIOps)

AI/ML for performance optimization, predictive maintenance, automated troubleshooting.

The trend points towards intelligent, disaggregated, and deeply integrated data ecosystems to meet the challenges of exascale computing and beyond.