Unlock Peak Performance

Modern CPUs are incredibly fast, but they are often starved for data from slow main memory. This guide provides an interactive exploration of how to write cache-efficient C code, bridging the "Memory Wall" to make your programs run dramatically faster.

The Memory Hierarchy

Data is stored in levels, each faster but smaller than the last. The goal is to keep frequently used data as close to the CPU as possible. Hover over each level to see its typical characteristics.

CPU Core
L1 Cache
L2 Cache
L3 Cache
Main Memory (DRAM)
Hover over a level for information.

Core Optimization Techniques

You can actively guide the compiler and CPU by structuring your data and code thoughtfully. The following techniques demonstrate how to organize data and access it efficiently.

Array of Structs (AoS) vs. Struct of Arrays (SoA)

The way you group your data has a huge impact on cache performance. Choose a layout that matches how you access the data. Below, we want to process only the `x` coordinates of 3D points.

Array of Structs (AoS)

Good for accessing all fields of one object at a time (`p.x, p.y, p.z`).

Struct of Arrays (SoA)

Ideal for processing one field across all objects (all `x`'s).

Click a simulation button to see the result.

Case Study: Matrix Multiplication

Matrix multiplication is a classic problem where cache optimization yields massive performance gains. The chart below shows the relative performance of different algorithms. Select an algorithm from the dropdown to see an explanation and a visualization of its inner loop access pattern (for Naive/Ordered).

Select an algorithm to see details.

Common Pitfalls

Even with good intentions, it's easy to fall into performance traps. These are subtle issues that arise from how hardware works. Click to expand each topic.

This occurs in multi-threaded code. If two threads modify independent variables that happen to be on the same cache line, the hardware's cache coherency protocol causes constant, expensive invalidations, creating a "phantom" bottleneck.

Solution: Use padding to push variables used by different threads onto separate cache lines.

This happens when your access pattern causes different memory locations to map to the same cache set (a specific part of the cache). They repeatedly kick each other out, leading to constant cache misses even if the data should fit in the cache.

Solution: Change data layout, add padding to alter memory addresses, or reorder data access to break the conflicting pattern.