We are next-door neighbors to one of the floor advisors, who patrol the hallways to make sure people don't go raping each other and whatnot. Our door is also propped open because it gets too hot in our room; Unfortunately that means we get to hear everything that goes on in the hallways, such as:
9:32 PM - A group of idiots walks past our door talking loudly when one of them trips and screams like an idiot. They all proceed to laugh hysterically right outside our door.
9:34 PM - We hear a very loud conversation through our wall in the RA's room next door, We begin talking about how pointless the quiet hours are.
9:36 PM - An RA tells us to be quiet and shuts our door.
WHAT THE FUCK?! Where were you guys like FOUR FUCKING MINUTES AGO?! But hey, you know what, I don't care. I don't even care if we get cited with a noise violation because I'm getting the _FUCK_ out of this idiotic place.
88 hours to go.
December 11, 2009
December 5, 2009
Memory Management on Modern CPUs
The following post is going to make very little sense:
Modern CPUs have started to branch out into parallelism and increased cache size due to their inability to decrease latency, since electricity can only travel so fast. A signal from one side of the CPU can no longer get to the other side in a single cycle. This has led to an exponential increase in cache size and other bizarre optimizations, such as the CPU attempting to guess what the code is going to do. In some cases the CPU might even decide to do something completely different then what the assembly told it to do just for the sake of speed, which can introduce serious issues for threading. The cache, however, is what we are concerned with. L1 cache takes 2 cycles, L2 takes about 14 cycles, and I don't know what L3 takes but its probably somewhere around 50. RAM on the other hand takes, on average, nearly 200 cycles. This means that cache misses are ridiculously expensive and made even worse by the fact that you can't really be sure that the CPU is actually doing what you're telling it to do. There's little we can do about the CPU running off with your code, but we can help it along by paying attention to data locality.
Linked lists have long had issues with data locality, and this makes it even more obvious. However, the answer isn't immediately obvious. We could switch over to vectors and arrays, sure, but in practice these incur even worse overheads when you simply iterate through them. One solution here is to combine the two - make a linked list that's a vector. However, while that does achieve what we're doing here, there is an even better, if slightly more complicated solution.
The number one slowdown in any program is memory allocation. If you aren't allocating something on the stack, it will likely cost as much time just to allocate a tiny chunk of memory then your entire function's execution time. This actually gets to the point of total absurdity when dealing with rapidly allocated and de-allocated small chunks of memory. In one example, a 2D graphics engine must assemble a list of pictures that are sorted by an optimization function, every single frame. If you even attempt to do this with the default memory allocation, it slows the application down to an absolute crawl. If, however, we implement our own memory management, the speed increases are somewhere around %15000. No, I'm not kidding, and yes, that is based off real-world testing.
The general attitude of the C++ community is that home-baked memory allocators are overkill, but as CPU technology advances, it turns out that in practice, the opposite is true. The default allocator was built in the 1990s and on modern architecture is pathetically slow. This, however, does not have to be a necessity. I have built several allocators for my graphics engine and every single one of them has increased performance by a factor of at least 2. The sheer amount of speed that can be gained from custom memory allocators is astounding, and there is a reason for it.
Custom memory allocators have a very distinct feature - they often concentrate a single class type into one compressed memory allocation space. When used with in conjunction with linked lists we get the best of both worlds, especially since when we free the memory, we aren't actually freeing the memory. We're just marking it as unused and then using it again next time we need it. Obviously I've only used this in situations where its obvious that the memory will be needed again very soon, but if done properly it can be expanded greatly. The fact that the memory is all in one place means that it results in a massive reduction of cache misses, which on modern hardware translates to a whole lot of cycles. Your program will run a lot faster if it isn't spending half its execution time just waiting for memory.
Those of you who are still following this explanation may have an idea of where I'm going with this. To answer the problem of data locality and memory management in modern applications, we require an entirely new kind of memory management, one based around locality, not memory efficiency (although as we'll see later, the resulting allocator can actually succeed at being efficient at the same time).
We start with a simple concept - an allocator for a given class type. All classes are, by definition, the same size. This means that any allocator designed for a single class will be incredibly fast. Usually in this situation, a bucket based approach is ideal, since we want the data to be close together, but we don't want to be dealing with stupidly large chunks of memory either. However we must also take note of how important data locality is on modern hardware, so we want those buckets to be comparatively large. But then we have another problem - what if a given class is only instantiated once? The answer is to have a bucket size algorithm that looks something like this:
Of course, this is made a lot easier if we know what the intended use of the class is in the first place. As long as the programmer tells the allocator how many classes of a given type to start with, we can skip to that amount on our allocation curve. Notice that at some point, the allocation curve flattens out to a 1:1 ratio, since if we're allocating anything larger the 32kb, its probably going to be something like 50 megs and should be delegated to the default memory allocator anyway.
Now we have an efficient method of building an allocator for a single type of class. But if we think about what we are actually doing here, an interesting optimization comes to mind. We are basing this allocator off 2 concepts: similar classes often end up being accessed often, and they are the same size.
Wait, they're the same size. What if we extended our concept such that if two classes are the same size, they get thrown into the same allocator? But we have a problem: going back to our original example, if we have a program-critical linked list that should obviously have a memory locale of its own in its own allocator or the entire point of reducing cache misses is lost. This is where we stumble on a very interesting concept: We allow the programmer to define their own locality.
What we do is have multiple allocators for each given class size. If we have several classes that are all 12 bytes (the sheer amount of classes that are 12 bytes is actually quite astounding), they'll be grouped into a single 12-byte bucket allocator. What happens is that we can assign an Index ID to this allocator. If a programmer has a class that he wants to have single allocator for, he can input an index ID of, for example, 12. This will cause another allocator to be created and used only for that class. However, we introduce the additional possibility of assigning an allocator for say, 2 or 3 classes. If they all have the same unique index ID, they'll be routed to the same allocator. This allows the programmer to have an unprecedented amount of control over data locality.
To complete our memory management pool, we require a global allocator to take care of our smaller allocators. The global allocator manages a gigantic pool of memory that is segmented into chunks for the various n-byte pools. When a new pool requires an amount of memory that is greater then the largest chunk available, the global allocator grabs another chunk from the default allocator. Furthermore, whenever this occurs, it triggers an optimization run on the memory pools. Empty buckets are freed and memory is moved where it is feasible. The global allocator will also need to pay attention to where the memory is freed so that, even if we have to allocate a giant chunk for some very large byte pool, we then have a bunch of memory that's available for less-used pools, again allowing us to avoid allocating additional memory. The combination of all 3 concepts allows a relatively efficient and super fast cache-aware memory allocator that can be implicitly built into any class by the inclusion of a very simple static template class. Almost everything can be done behind the scenes, and we never even overload the global new operator, since participating classes are marked (there are a buttload of reasons why we wouldn't want to overload the global new operator but I won't go into that here).
No tests of this system are available as I have not built it yet, but I will report on the results once I have them.
Modern CPUs have started to branch out into parallelism and increased cache size due to their inability to decrease latency, since electricity can only travel so fast. A signal from one side of the CPU can no longer get to the other side in a single cycle. This has led to an exponential increase in cache size and other bizarre optimizations, such as the CPU attempting to guess what the code is going to do. In some cases the CPU might even decide to do something completely different then what the assembly told it to do just for the sake of speed, which can introduce serious issues for threading. The cache, however, is what we are concerned with. L1 cache takes 2 cycles, L2 takes about 14 cycles, and I don't know what L3 takes but its probably somewhere around 50. RAM on the other hand takes, on average, nearly 200 cycles. This means that cache misses are ridiculously expensive and made even worse by the fact that you can't really be sure that the CPU is actually doing what you're telling it to do. There's little we can do about the CPU running off with your code, but we can help it along by paying attention to data locality.
Linked lists have long had issues with data locality, and this makes it even more obvious. However, the answer isn't immediately obvious. We could switch over to vectors and arrays, sure, but in practice these incur even worse overheads when you simply iterate through them. One solution here is to combine the two - make a linked list that's a vector. However, while that does achieve what we're doing here, there is an even better, if slightly more complicated solution.
The number one slowdown in any program is memory allocation. If you aren't allocating something on the stack, it will likely cost as much time just to allocate a tiny chunk of memory then your entire function's execution time. This actually gets to the point of total absurdity when dealing with rapidly allocated and de-allocated small chunks of memory. In one example, a 2D graphics engine must assemble a list of pictures that are sorted by an optimization function, every single frame. If you even attempt to do this with the default memory allocation, it slows the application down to an absolute crawl. If, however, we implement our own memory management, the speed increases are somewhere around %15000. No, I'm not kidding, and yes, that is based off real-world testing.
The general attitude of the C++ community is that home-baked memory allocators are overkill, but as CPU technology advances, it turns out that in practice, the opposite is true. The default allocator was built in the 1990s and on modern architecture is pathetically slow. This, however, does not have to be a necessity. I have built several allocators for my graphics engine and every single one of them has increased performance by a factor of at least 2. The sheer amount of speed that can be gained from custom memory allocators is astounding, and there is a reason for it.
Custom memory allocators have a very distinct feature - they often concentrate a single class type into one compressed memory allocation space. When used with in conjunction with linked lists we get the best of both worlds, especially since when we free the memory, we aren't actually freeing the memory. We're just marking it as unused and then using it again next time we need it. Obviously I've only used this in situations where its obvious that the memory will be needed again very soon, but if done properly it can be expanded greatly. The fact that the memory is all in one place means that it results in a massive reduction of cache misses, which on modern hardware translates to a whole lot of cycles. Your program will run a lot faster if it isn't spending half its execution time just waiting for memory.
Those of you who are still following this explanation may have an idea of where I'm going with this. To answer the problem of data locality and memory management in modern applications, we require an entirely new kind of memory management, one based around locality, not memory efficiency (although as we'll see later, the resulting allocator can actually succeed at being efficient at the same time).
We start with a simple concept - an allocator for a given class type. All classes are, by definition, the same size. This means that any allocator designed for a single class will be incredibly fast. Usually in this situation, a bucket based approach is ideal, since we want the data to be close together, but we don't want to be dealing with stupidly large chunks of memory either. However we must also take note of how important data locality is on modern hardware, so we want those buckets to be comparatively large. But then we have another problem - what if a given class is only instantiated once? The answer is to have a bucket size algorithm that looks something like this:
Of course, this is made a lot easier if we know what the intended use of the class is in the first place. As long as the programmer tells the allocator how many classes of a given type to start with, we can skip to that amount on our allocation curve. Notice that at some point, the allocation curve flattens out to a 1:1 ratio, since if we're allocating anything larger the 32kb, its probably going to be something like 50 megs and should be delegated to the default memory allocator anyway.
Now we have an efficient method of building an allocator for a single type of class. But if we think about what we are actually doing here, an interesting optimization comes to mind. We are basing this allocator off 2 concepts: similar classes often end up being accessed often, and they are the same size.
Wait, they're the same size. What if we extended our concept such that if two classes are the same size, they get thrown into the same allocator? But we have a problem: going back to our original example, if we have a program-critical linked list that should obviously have a memory locale of its own in its own allocator or the entire point of reducing cache misses is lost. This is where we stumble on a very interesting concept: We allow the programmer to define their own locality.
What we do is have multiple allocators for each given class size. If we have several classes that are all 12 bytes (the sheer amount of classes that are 12 bytes is actually quite astounding), they'll be grouped into a single 12-byte bucket allocator. What happens is that we can assign an Index ID to this allocator. If a programmer has a class that he wants to have single allocator for, he can input an index ID of, for example, 12. This will cause another allocator to be created and used only for that class. However, we introduce the additional possibility of assigning an allocator for say, 2 or 3 classes. If they all have the same unique index ID, they'll be routed to the same allocator. This allows the programmer to have an unprecedented amount of control over data locality.
To complete our memory management pool, we require a global allocator to take care of our smaller allocators. The global allocator manages a gigantic pool of memory that is segmented into chunks for the various n-byte pools. When a new pool requires an amount of memory that is greater then the largest chunk available, the global allocator grabs another chunk from the default allocator. Furthermore, whenever this occurs, it triggers an optimization run on the memory pools. Empty buckets are freed and memory is moved where it is feasible. The global allocator will also need to pay attention to where the memory is freed so that, even if we have to allocate a giant chunk for some very large byte pool, we then have a bunch of memory that's available for less-used pools, again allowing us to avoid allocating additional memory. The combination of all 3 concepts allows a relatively efficient and super fast cache-aware memory allocator that can be implicitly built into any class by the inclusion of a very simple static template class. Almost everything can be done behind the scenes, and we never even overload the global new operator, since participating classes are marked (there are a buttload of reasons why we wouldn't want to overload the global new operator but I won't go into that here).
No tests of this system are available as I have not built it yet, but I will report on the results once I have them.
December 4, 2009
Today is the best day in the world
Because I got into MATH 125.
of course its now 3 AM so tomorrow is going to suck balls but I DONT CARE.
of course its now 3 AM so tomorrow is going to suck balls but I DONT CARE.
December 3, 2009
I'm Popular!
How do I know? Because iZone stole 6 of my songs, without even bothering to rename them. It all started when a friend of mine pointed out that someone had stolen Retarded Windows Song, which eventually led me to a guy giving out his e-mail so he can use The Dark Temple in his game.
That was the most fun I've ever had writing an e-mail.
I reported him, but not before taking this screenshot.
That was the most fun I've ever had writing an e-mail.
I reported him, but not before taking this screenshot.
Turning a 3D animation into a 2D animation
I don't have anyone on MSN to excitedly spew out this idea to so I'll just do this as a journal entry. Everyone knows about Celshading. Cheap Anime games use it to try and make 3D look 2D and often fail spectacularly, although its still a nifty effect.
After studying The Secret of NIMH, I've noticed several particular aspects of the movie that could feasibly be recreated in a 3D graphics engine in such a way that it would be indistinguishable from the real thing, provided it did not break certain constraints.
Shading: Cel-shading is not used all the time. Modern anime-inspired animation often likes to cel-shade everything, but if you look at many classic films, it's actually only used in instances of high contrast, and more particularly its very unevenly done, even if its a relatively stable light source. In a 3D context this would mean flatshading almost all the dynamic models and implementing a very special lighting system that would only trigger with certain "tagged" lights (there is no possible way to dynamically figure this out in a manner consistent with traditional animation). In contrast to this, the backgrounds are extremely detailed paintings, which often end up being lit with either celshading or an interesting gradient effect.
A separate interpretative shader for the background could be adopted (along with a special case for lighting if necessary) to create the painting effect, but the most striking feature about all backgrounds in all 2D animations is the fact that they are always static. That is to say, if the character is "walking around a table," the background is really just a long, looping single picture. This is an effect that is exceedingly difficult to duplicate in 3D because it is very much not 3D. One way to get around this problem is to take a different approach, and that is to take advantage of the fact that in 99% of circumstances in a 2D animation, a given camera angle will stay on a flat 2D plane, even if the characters are supposedly moving around in 3 dimensions. Because of this, we can orient our camera for the beginning of a scene, and then unwrap the entire background's 3D model into a panoramic representation. If done correctly, it would allow an artist to build a 3D model and then orient the camera and figure out how the panning in a 2D context would be done from that perspective.
Lines: The lineart in a traditional 2D animation turns out to be one of the most important aspects, which is to be expected. However an unexpected detail that most of us miss when watching an animated film is the fact that those black lines are actually very, very thin. Forget the thick black lines you see them attempting to do in those idiotic 3D hackups, all of the lines in traditional animation are thin, and the exact same width. Except for eyebrows, they are all the same width.
Hence, getting the lines right is going to be a make-or-break part of a 3D interpretative shader. This is usually done via an edge shader by doing a per-pixel depth check. In theory, this would also allow us to make lines that are relatively closer to the viewer thicker, and ones farther away, thinner, but it turns out that this isn't actually needed if we're mimicking traditional 2D animation (this is probably because doing so would be a total bitch to animate). I'm going to come back to this later so keep it in mind.
Getting lines right is hard for humans in the first place, and often requires meticulous design of a 3D model, but its even more ridiculous for furry mice. Hair in particular requires a static 3D model or it goes all over the place. The requirements for fur, however, are less strict, and often it looks completely wrong if you attempt to model it. The trick is to get the shader to do what a human does when examining where to put tufts of fur. Steep angles, joints, and 1-3 possible curls on longer stretches, equidistant apart with a slight deviation of approximately 10% of the length of the stretch. This is influenced by how "furry" a given section is; for example, Mrs. Brisby has a particularly furry chin and thus we end up with a large tuft there. what gets more ridiculous, however, is that if your going to get this to line up with the lines is that you have to model all these tufts of fur on the model itself so that the edge shader can properly line it. This can be done either with a displacement map or with a geometry shader.
Realtime: The possibility of a geometry shader opens up the stunning possibility of doing this in realtime. Because this algorithm translates a detailed 3D scene into a traditional 2D animation, it would suddenly become possible to do what was previously impossible - Design a 2D game that is dynamic, able to generate an entire, completely random forest, and then figure out how to render it as a background that looks painted and is unwrapped into a 2D panorama. The character that you control could look hand drawn and yet respond to literally anything using procedural animation. You could explore any angle in any way possible and the game could still figure out how its supposed to look.
Future work: This technique allows an animation to do things that were not previously possible, such as animating lines that take into account their distance from the viewer, as well as characters being able to interact with background elements without the background elements having to be cel-shaded. Even weirder possibilities include seamless integration of realistic CGI characters with cel-shaded ones, and putting cel-shaded characters in a realistic CGI environment, a la Roger Rabbit except there would be no combining footage - it'd all just work in a single environment and the 2D characters would be able to perfectly respond to their surroundings.
So yeah that's my essay for the week.
Also I figured out that you can embed a Hulu video into LiveJournal, so for those of you who are still without the grace of The Secret of NIMH, here it is:
After studying The Secret of NIMH, I've noticed several particular aspects of the movie that could feasibly be recreated in a 3D graphics engine in such a way that it would be indistinguishable from the real thing, provided it did not break certain constraints.
Shading: Cel-shading is not used all the time. Modern anime-inspired animation often likes to cel-shade everything, but if you look at many classic films, it's actually only used in instances of high contrast, and more particularly its very unevenly done, even if its a relatively stable light source. In a 3D context this would mean flatshading almost all the dynamic models and implementing a very special lighting system that would only trigger with certain "tagged" lights (there is no possible way to dynamically figure this out in a manner consistent with traditional animation). In contrast to this, the backgrounds are extremely detailed paintings, which often end up being lit with either celshading or an interesting gradient effect.
A separate interpretative shader for the background could be adopted (along with a special case for lighting if necessary) to create the painting effect, but the most striking feature about all backgrounds in all 2D animations is the fact that they are always static. That is to say, if the character is "walking around a table," the background is really just a long, looping single picture. This is an effect that is exceedingly difficult to duplicate in 3D because it is very much not 3D. One way to get around this problem is to take a different approach, and that is to take advantage of the fact that in 99% of circumstances in a 2D animation, a given camera angle will stay on a flat 2D plane, even if the characters are supposedly moving around in 3 dimensions. Because of this, we can orient our camera for the beginning of a scene, and then unwrap the entire background's 3D model into a panoramic representation. If done correctly, it would allow an artist to build a 3D model and then orient the camera and figure out how the panning in a 2D context would be done from that perspective.
Lines: The lineart in a traditional 2D animation turns out to be one of the most important aspects, which is to be expected. However an unexpected detail that most of us miss when watching an animated film is the fact that those black lines are actually very, very thin. Forget the thick black lines you see them attempting to do in those idiotic 3D hackups, all of the lines in traditional animation are thin, and the exact same width. Except for eyebrows, they are all the same width.
Hence, getting the lines right is going to be a make-or-break part of a 3D interpretative shader. This is usually done via an edge shader by doing a per-pixel depth check. In theory, this would also allow us to make lines that are relatively closer to the viewer thicker, and ones farther away, thinner, but it turns out that this isn't actually needed if we're mimicking traditional 2D animation (this is probably because doing so would be a total bitch to animate). I'm going to come back to this later so keep it in mind.
Getting lines right is hard for humans in the first place, and often requires meticulous design of a 3D model, but its even more ridiculous for furry mice. Hair in particular requires a static 3D model or it goes all over the place. The requirements for fur, however, are less strict, and often it looks completely wrong if you attempt to model it. The trick is to get the shader to do what a human does when examining where to put tufts of fur. Steep angles, joints, and 1-3 possible curls on longer stretches, equidistant apart with a slight deviation of approximately 10% of the length of the stretch. This is influenced by how "furry" a given section is; for example, Mrs. Brisby has a particularly furry chin and thus we end up with a large tuft there. what gets more ridiculous, however, is that if your going to get this to line up with the lines is that you have to model all these tufts of fur on the model itself so that the edge shader can properly line it. This can be done either with a displacement map or with a geometry shader.
Realtime: The possibility of a geometry shader opens up the stunning possibility of doing this in realtime. Because this algorithm translates a detailed 3D scene into a traditional 2D animation, it would suddenly become possible to do what was previously impossible - Design a 2D game that is dynamic, able to generate an entire, completely random forest, and then figure out how to render it as a background that looks painted and is unwrapped into a 2D panorama. The character that you control could look hand drawn and yet respond to literally anything using procedural animation. You could explore any angle in any way possible and the game could still figure out how its supposed to look.
Future work: This technique allows an animation to do things that were not previously possible, such as animating lines that take into account their distance from the viewer, as well as characters being able to interact with background elements without the background elements having to be cel-shaded. Even weirder possibilities include seamless integration of realistic CGI characters with cel-shaded ones, and putting cel-shaded characters in a realistic CGI environment, a la Roger Rabbit except there would be no combining footage - it'd all just work in a single environment and the 2D characters would be able to perfectly respond to their surroundings.
So yeah that's my essay for the week.
Also I figured out that you can embed a Hulu video into LiveJournal, so for those of you who are still without the grace of The Secret of NIMH, here it is: