Erik McClure: Memory Management on Modern CPUs

The following post is going to make very little sense:

Modern CPUs have started to branch out into parallelism and increased cache size due to their inability to decrease latency, since electricity can only travel so fast. A signal from one side of the CPU can no longer get to the other side in a single cycle. This has led to an exponential increase in cache size and other bizarre optimizations, such as the CPU attempting to guess what the code is going to do. In some cases the CPU might even decide to do something completely different then what the assembly told it to do just for the sake of speed, which can introduce serious issues for threading. The cache, however, is what we are concerned with. L1 cache takes 2 cycles, L2 takes about 14 cycles, and I don't know what L3 takes but its probably somewhere around 50. RAM on the other hand takes, on average, nearly 200 cycles. This means that cache misses are ridiculously expensive and made even worse by the fact that you can't really be sure that the CPU is actually doing what you're telling it to do. There's little we can do about the CPU running off with your code, but we can help it along by paying attention to data locality.

Linked lists have long had issues with data locality, and this makes it even more obvious. However, the answer isn't immediately obvious. We could switch over to vectors and arrays, sure, but in practice these incur even worse overheads when you simply iterate through them. One solution here is to combine the two - make a linked list that's a vector. However, while that does achieve what we're doing here, there is an even better, if slightly more complicated solution.

The number one slowdown in any program is memory allocation. If you aren't allocating something on the stack, it will likely cost as much time just to allocate a tiny chunk of memory then your entire function's execution time. This actually gets to the point of total absurdity when dealing with rapidly allocated and de-allocated small chunks of memory. In one example, a 2D graphics engine must assemble a list of pictures that are sorted by an optimization function, every single frame. If you even attempt to do this with the default memory allocation, it slows the application down to an absolute crawl. If, however, we implement our own memory management, the speed increases are somewhere around %15000. No, I'm not kidding, and yes, that is based off real-world testing.

The general attitude of the C++ community is that home-baked memory allocators are overkill, but as CPU technology advances, it turns out that in practice, the opposite is true. The default allocator was built in the 1990s and on modern architecture is pathetically slow. This, however, does not have to be a necessity. I have built several allocators for my graphics engine and every single one of them has increased performance by a factor of at least 2. The sheer amount of speed that can be gained from custom memory allocators is astounding, and there is a reason for it.

Custom memory allocators have a very distinct feature - they often concentrate a single class type into one compressed memory allocation space. When used with in conjunction with linked lists we get the best of both worlds, especially since when we free the memory, we aren't actually freeing the memory. We're just marking it as unused and then using it again next time we need it. Obviously I've only used this in situations where its obvious that the memory will be needed again very soon, but if done properly it can be expanded greatly. The fact that the memory is all in one place means that it results in a massive reduction of cache misses, which on modern hardware translates to a whole lot of cycles. Your program will run a lot faster if it isn't spending half its execution time just waiting for memory.

Those of you who are still following this explanation may have an idea of where I'm going with this. To answer the problem of data locality and memory management in modern applications, we require an entirely new kind of memory management, one based around locality, not memory efficiency (although as we'll see later, the resulting allocator can actually succeed at being efficient at the same time).

We start with a simple concept - an allocator for a given class type. All classes are, by definition, the same size. This means that any allocator designed for a single class will be incredibly fast. Usually in this situation, a bucket based approach is ideal, since we want the data to be close together, but we don't want to be dealing with stupidly large chunks of memory either. However we must also take note of how important data locality is on modern hardware, so we want those buckets to be comparatively large. But then we have another problem - what if a given class is only instantiated once? The answer is to have a bucket size algorithm that looks something like this:

Of course, this is made a lot easier if we know what the intended use of the class is in the first place. As long as the programmer tells the allocator how many classes of a given type to start with, we can skip to that amount on our allocation curve. Notice that at some point, the allocation curve flattens out to a 1:1 ratio, since if we're allocating anything larger the 32kb, its probably going to be something like 50 megs and should be delegated to the default memory allocator anyway.

Now we have an efficient method of building an allocator for a single type of class. But if we think about what we are actually doing here, an interesting optimization comes to mind. We are basing this allocator off 2 concepts: similar classes often end up being accessed often, and they are the same size.

Wait, they're the same size. What if we extended our concept such that if two classes are the same size, they get thrown into the same allocator? But we have a problem: going back to our original example, if we have a program-critical linked list that should obviously have a memory locale of its own in its own allocator or the entire point of reducing cache misses is lost. This is where we stumble on a very interesting concept: We allow the programmer to define their own locality.

What we do is have multiple allocators for each given class size. If we have several classes that are all 12 bytes (the sheer amount of classes that are 12 bytes is actually quite astounding), they'll be grouped into a single 12-byte bucket allocator. What happens is that we can assign an Index ID to this allocator. If a programmer has a class that he wants to have single allocator for, he can input an index ID of, for example, 12. This will cause another allocator to be created and used only for that class. However, we introduce the additional possibility of assigning an allocator for say, 2 or 3 classes. If they all have the same unique index ID, they'll be routed to the same allocator. This allows the programmer to have an unprecedented amount of control over data locality.

To complete our memory management pool, we require a global allocator to take care of our smaller allocators. The global allocator manages a gigantic pool of memory that is segmented into chunks for the various n-byte pools. When a new pool requires an amount of memory that is greater then the largest chunk available, the global allocator grabs another chunk from the default allocator. Furthermore, whenever this occurs, it triggers an optimization run on the memory pools. Empty buckets are freed and memory is moved where it is feasible. The global allocator will also need to pay attention to where the memory is freed so that, even if we have to allocate a giant chunk for some very large byte pool, we then have a bunch of memory that's available for less-used pools, again allowing us to avoid allocating additional memory. The combination of all 3 concepts allows a relatively efficient and super fast cache-aware memory allocator that can be implicitly built into any class by the inclusion of a very simple static template class. Almost everything can be done behind the scenes, and we never even overload the global new operator, since participating classes are marked (there are a buttload of reasons why we wouldn't want to overload the global new operator but I won't go into that here).

No tests of this system are available as I have not built it yet, but I will report on the results once I have them.

Erik McClure

December 5, 2009

Memory Management on Modern CPUs

No comments:

Post a Comment