Cache Crash Course - Conclusion (Part 5)

The conclusion to our deep dive into the CPU cache.

2020/07/07

New here? Here's part one, which you should read first if you haven't already.

In the previous sections we explored the impact of the CPU data and instruction caches on performance.

So, what have we learned?

Profile! Profile! Profile!

On more than one occasion when working on the benchmark I discovered that a seemingly innocuous change non-trivially negatively impacted performance. Nothing illustrates this more clearly than Version 1 to Version 2.

Predictability is key to unlocking CPU performance

The performance impact of switching from a list to a vector is astonishing. Given we’re linearly traversing the list in order and accessing every element, this is the best-case performance of both a list and array, so the performance on show here demonstrates the importance of predictability. Modern CPUs are fantastic at predicting what the read pattern of memory will be.

Now, you might have spotted the random sized allocation in there to simulate something vaguely resembling a normal memory layout rather than a pristine virtual memory space. However, it actually has less impact than you’d think. I was fully expecting to see Version 2 pull closer to Version 3 if we eliminated this randomness (as we’d expect the allocations to be very close together) but this isn’t actually the case – at least not on my hardware and OS.

Don’t update what you don’t need to update

I didn’t include this in the benchmark itself, but the fastest way to speed up the benchmark itself is to do less work. As a quick test this drops the runtime time to about 0.1ms with all other optimizations applied.

This may seem obvious, but if nothing interacts with your data, then it probably doesn't need an update. Likewise, if an object is not visible to the user, and that object can’t move, could you just update it when the user next sees it, and if necessary, account for the lost update time?

Padding

The more useful data you can cram into the cache lines you fetch, the fewer cache misses you’ll have to wait on. Not every cache miss incurs a stall or immediate penalty, but it may displace other data in the cache that you might need later. This means that optimizing the cache usage in one area may positively impact the cache performance in another.

Seriously, have you profiled it yet?

I can’t stress enough how important this is! Likewise, try to ensure the test cases you’re examining are realistic. Is your test representative of how your app/game is likely to be encountered by real users? Don’t optimize cases that don’t matter.

The end

We’ve reached the end of our journey into the cache. I hope you learned something along the way because I certainly did!

If you want to continue your journey onward without me, I suggest reading further into data oriented design. The excellent Game Programming Patterns by Robert Nystrom is also great, and the Data Locality chapter is especially of interest here. Scott Meyers also did a talk titled Cpu Caches and Why You Care that is well worth your time. Finally, it may be worth a quick peek into the world of Cache-oblivious algorithms. This blog post explores the topic and links to additional resources.

Stay safe and don’t splash that cache!