CUDA Programming Model Explorer

An interactive guide to Grids, Blocks, Threads, Warps, and Memory.

Configure Your Thread Block

In CUDA, developers specify the number of threads per block. This choice directly impacts how the GPU hardware organizes these threads into Warps (groups of 32) for SIMT execution.

Use the slider below to visualize this relationship and its performance implications.

32 128 256 384 512

Threads per Block

256

Warps per Block

8

Warp Size (Fixed)

32

Inactive Thread Slots

0

Performance Implication:

A block size of 256 threads is a common and efficient choice. It creates 8 full warps, ensuring no hardware resources are wasted on inactive threads within the last warp.

Logical View: Programmer's Thread Block

This represents the block of threads as defined by the programmer. Each small square is a thread, colored by the warp it will belong to.

Physical View: Hardware Warp Grouping

The GPU hardware partitions the block's threads into fixed-size Warps of 32. This is how the hardware schedules and executes them.

CUDA Concepts Explained

The CUDA programming model organizes parallel computation through a hierarchy: Grids, Thread Blocks, and Threads. This structure allows developers to manage and scale parallelism effectively on the GPU.

Grid: A kernel is launched as a grid of thread blocks. A grid can be 1D, 2D, or 3D, allowing a natural mapping of computation to the problem's data structure (e.g., a 2D grid for image processing). All blocks in a grid run the same kernel code.

Thread Block: A thread block is a group of threads that execute concurrently on the same Streaming Multiprocessor (SM). Threads within a block can cooperate by sharing data via fast on-chip shared memory and can synchronize their execution using barriers like __syncthreads(). Blocks can also be 1D, 2D, or 3D.

Thread: The fundamental unit of parallel execution. Each thread executes an instance of the kernel function. Threads are identified within their block by a unique threadIdx (which can be 1D, 2D, or 3D). A global thread ID can be computed using threadIdx and blockIdx (the block's unique ID within the grid).

Visualizing a 2D Grid of 2D Blocks:

The diagram below illustrates how a 2D grid is composed of 2D blocks, and how threads within a block are indexed. This is analogous to Figure 3 in the NVIDIA "CUDA Refresher" blog post.

Grid (e.g., gridDim(3,2))

blockIdx
(0,0)
blockIdx (1,0)
t(0,0)
t(1,0)
t(0,1)
t(1,1)
(e.g. blockDim(2,2))
blockIdx
(2,0)
blockIdx
(0,1)
blockIdx
(1,1)
blockIdx
(2,1)

Each 'blockIdx(x,y)' is a thread block. The highlighted block shows threads 't(x,y)' representing 'threadIdx(x,y)'.