Cache Crash Course - Benchmarking (Part 4)

Part four of our deep dive into the CPU cache and how to optimize for it. Part four puts the concepts together and provides a semi-realistic scenario of cache zero to cache hero.

2020/07/07

New here? Here's part one, which you should read first if you haven't already.

By now you should understand the concepts of caching and be ready to apply it to your own code. In this section we will explore the real-world impact of these changes in a semi-realistic scenario. What most people don't realise is the impact these changes can have - I certainly didn't. I was expecting big wins but the numbers I encountered while writing this benchmark shocked me.

So, what happens if you apply all these rules? Or rather, what happens if we do the exact opposite and then fix it? I wrote a small benchmark that was representative of all the mistakes above, then fixed them all in a series of changes. It’s a bit contrived but bear with me.

These are the rules I stuck to:

No data can be lost. All the data present in the app is assumed useful to the program, but we don’t care about how that data is accessed (or initialized) outside of our “hot path.”
This is not a general optimization demo, so I can’t modify the core of the algorithm, just how the data it needs is accessed.
Because we don’t necessarily need to write the result for every object, we’ll pick a subset of objects to write their result back based on their index. You can pretend this is based on the value written or some other Boolean. This is controlled by the PARTIAL_WRITEBACK define.

The code

I've uploaded the code for this benchmark to Github if you wish to run it on your own machine or do further experiments than what I've outlined here.

The different Version header files in this folder represent the modifiable part of the benchmark, with the rest remaining the same across runs. The interface is a little unusual, but it was structured this way to allow all aspects of the data access to be manipulated across the versions. What follows is a brief summary of the changes made in each version, each iterating from the previous version.

Version 1

This is our starting point. Imagine this codebase has evolved over time through several years and developers. All the bad practices seen here I’ve seen in real production code. In fact, this isn’t even that bad by comparison...

Version 2

We remove the separate Transform object here resulting in less indirection and put the data it contains next to the other used data. Now, something interesting happens here – the performance actually decreases!

This version is ~8% slower than Version 1, yet we removed an indirection. After removing the indirection, the struct size increased by 2%. This caught me off guard because this seemed to be a disproportionate impact, but if we remove 16 bytes from somewhere else in the struct the performance penalty is removed. There’s no immediate performance benefit of this change like I was expecting, but at least we didn’t make a regression. You may see a performance benefit from flattening your data however.

Version 3

Now we swap a std::list for std::vector. That’s the only change. This has the largest performance impact of all the changes. Depending on hardware this is anywhere from 2.5x faster to 60x faster! This change demonstrates the impact of memory read predictability on performance.

Version 4

Now we tidy up the mess of data and organize it to reduce padding. Depending on memory bandwidth this can have a decent performance impact as you’re reducing the amount of wasted space from padding in the cache lines fetched. My high end CPU and fast memory throughput diminish the perceived gain here however.

Version 5

Here we swap the 4 byte Boolean type for a one byte Boolean. The history of why this might have made it into coding standards is complex, but I’ve worked with coding standards that mandated using a 32bit type as a Boolean.

Again, the gains here are minimal unless there are a large number of Booleans in use, in which case the small savings can add up.

Version 6

Next, we move the data we don’t need for our “hot path” out into a separate structure but keep a pointer from our primary structure. I found that in some cases this was actually slower (despite our structure decreasing from 656 bytes to 88 bytes), but these remain edge cases. In most cases you will see a performance boost from making this change. This should become obvious if you hit this case though as you’ll have profiled the change, right?

Version 7

In version 7 we make additional changes by batching data used together into their individual arrays. The pointer to the “other” data is now separated out and isn’t referenced in the primary structure at all. Instead we rely on the object’s index to access the other data sets.

The result data is also written directly to its own array. We likely get additional benefits here by never reading the result value, only writing it directly to its target location.

If there was padding in our structure at this point we might consider splitting each individual member into its own array and keep the arrays in sync. When we have no padding or alignment however, and always read all the data, there's no performance benefit in doing this. If this wasn't the case we might get a further speed up by doing this.

Results

A quick side note before we examine the results, if you are profiling using Visual Studio, it is important that even with release builds that you don't run with the debugger attached (profiler is fine, just the debugger). Although the severity of the performance hit when a debugger is attached varies between Visual Studio versions, additional debugging options are enabled that make reproducing bugs more likely (such as the debug allocator and iterator debugging), but these can negatively impact performance. As such when running performance sensitive code (such as this benchmark), run it from outside Visual Studio.

Benchmark results for MSVC (Microsoft Visual C++)

Below is a diff tool to help you compare the code between the versions for easier comparison against the results above.

Are you ready to jump to conclusions (part five)?

Part five