Unlock Peak Performance
Modern CPUs are incredibly fast, but they are often starved for data from slow main memory. This guide provides an interactive exploration of how to write cache-efficient C code, bridging the "Memory Wall" to make your programs run dramatically faster.
The Memory Hierarchy
Data is stored in levels, each faster but smaller than the last. The goal is to keep frequently used data as close to the CPU as possible. Hover over each level to see its typical characteristics.
Core Optimization Techniques
You can actively guide the compiler and CPU by structuring your data and code thoughtfully. The following techniques demonstrate how to organize data and access it efficiently.
Array of Structs (AoS) vs. Struct of Arrays (SoA)
The way you group your data has a huge impact on cache performance. Choose a layout that matches how you access the data. Below, we want to process only the `x` coordinates of 3D points.
Array of Structs (AoS)
Good for accessing all fields of one object at a time (`p.x, p.y, p.z`).
Struct of Arrays (SoA)
Ideal for processing one field across all objects (all `x`'s).
Loop Ordering
In C's row-major memory layout, accessing elements in the same row is fast (stride-1 access), while jumping between rows is slow. Watch how changing the loop order affects the memory access pattern on a 2D matrix.
Cache-Friendly (Row-wise)
for (row)... for (col)...
Cache-Unfriendly (Column-wise)
for (col)... for (row)...
Case Study: Matrix Multiplication
Matrix multiplication is a classic problem where cache optimization yields massive performance gains. The chart below shows the relative performance of different algorithms. Select an algorithm from the dropdown to see an explanation and a visualization of its inner loop access pattern (for Naive/Ordered).
Common Pitfalls
Even with good intentions, it's easy to fall into performance traps. These are subtle issues that arise from how hardware works. Click to expand each topic.