Erik McClure: Multithreading Problems In Game Design

May 23, 2012

Multithreading Problems In Game Design

A couple years ago, when I first started designing a game engine to unify Box2D and my graphics engine, I thought this was a superb opportunity to join all the cool kids and multithread it. I mean all the other game developers were talking about having a thread for graphics, a thread for physics, a thread for audio, etc. etc. etc. So I spent a lot of time teaching myself various lockless threading techniques and building quite a few iterations of various multithreading structures. Almost all of them failed spectacularly for various reasons, but in the end they were all too complicated.

I eventually settled on a single worker thread that was sent off to start working on the physics at the beginning of a frame render. Then at the beginning of each subsequent frame I would check to see if the physics were done, and if so sync the physics and graphics and start up another physics render iteration. It was a very clean solution, but fundamentally flawed. For one, it introduces an inescapable frame of input lag.

Single Thread Low Load

  FRAME 1   +----+

            |    |

. Input1 -> |    |

            |[__]| Physics   

            |[__]| Render    

. FRAME 2   +----+ INPUT 1 ON BACKBUFFER

. Input2 -> |    |

. Process ->|    |

            |[__]| Physics

. Input3 -> |[__]| Render

. FRAME 3   +----+ INPUT 2 ON BACKBUFFER, INPUT 1 VISIBLE

.           |    |

.           |    |

. Process ->|[__]| Physics

            |[__]| Render

  FRAME 4   +----+ INPUT 3 ON BACKBUFFER, INPUT 2 VISIBLE



Multi Thread Low Load

  FRAME 1   +----+

            |    | 

            |    |

. Input1 -> |    | 

.           |[__]| Render/Physics START  

. FRAME 2   +----+        

. Input2 -> |____| Physics END

.           |    |

.           |    | 

. Input3 -> |[__]| Render/Physics START

. FRAME 3   +----+ INPUT 1 ON BACKBUFFER

.           |____|

.           |    | PHYSICS END

.           |    | 

            |____| Render/Physics START

  FRAME 4   +----+ INPUT 2 ON BACKBUFFER, INPUT 1 VISIBLE

The multithreading, by definition, results in any given physics update only being reflected in the next rendered frame, because the entire point of multithreading is to immediately start rendering the current frame as soon as you start calculating physics. This causes a number of latency issues, but in addition it requires that one introduce a separated "physics update" function to be executed only during the physics/graphics sync. As I soon found out, this is a massive architectural complication, especially when you try to put in scripting or let other languages use your engine.

There is another, more subtle problem with dedicated threads for graphics/physics/audio/AI/anything. It doesn't scale. Let's say you have a platformer - AI will be contained inside the game logic, and the absolute vast majority of your CPU time will either be spent in graphics or physics, or possibly both. That means your game effectively only has two threads that are doing any meaningful amount of work. Modern processors have 8 logical cores¹, and the best one currently available has 12. You're using two of them. You introduced all this complexity and input lag just so you could use 16.6% of the processor instead of 8.3%.

Instead of trying to create a thread for each individual component, you need to go deeper. You have to parallelize each individual component separately, then tie them together in a single-threaded design. This has the added bonus of being vastly more friendly to single-threaded CPUs that can't thread things (like certain phones), because the parallization goes on at a lower level and is invisible to the overall architecture of the library. So instead of having a graphics thread and a physics thread, you simply call the physics update, then call the graphics update, and inside each physics and graphics update you spawn enough worker threads to match the number of cores you have to work with and concurrently process as much stuff as possible. This eliminates latency problems, complicated library designs, and it scales forever. Even if your initial implementation of concurrency won't handle 32 cores, because the concurrency is encapsulated inside the engine, you can just go back and change it later without ever having to modify any programs that use the graphics engine.

Consequently, don't try to multithread your games. It isn't worth it. Separately parallelize each individual component instead and write your game engine single-threaded; only use additional threads for asynchronous activities like resource loading.

¹ The processors actually only have 4 or 6 physical cores, but use hyperthreading techniques so that 8 or 12 logical cores are presented to the operating system. From a software point of view, however, this is immaterial.

27 comments:

FeepingCreatureMay 24, 2012, 4:19:00 AM
Well-written and well-reasoned. Thank you for this post.
ReplyDelete
Replies
ioquatixMay 24, 2012, 7:05:00 AM
Interesting experience. I've personally found that multi-threading is a lot more stable as long as you manage the latency. As you've said, it isn't easy, but it can be done. I normally use threads for physics and rendering, and then also sub-divide the work in individual threads as is possible - e.g. using 2 or 4 threads for a particle simulation update, splitting the world up and working in different areas in the physics loop on different threads. Additionally, networking and audio are easier to manage on their own threads and then feed events back into the main physics loop. The physics loop can be tightly timed (e.g. running 120hz or more, tightly synchronised with other sub-systems or over the network) and I've also been investigating running a physics loop with double buffering to avoid locking for rendering. There are many interesting possibilities to utilise 8 hardware backed threads or more!
ReplyDelete
Replies
ChristianMay 24, 2012, 7:37:00 AM
Modern gaming design would be well served making use of the vastly superior floating point operations of the GPU. I know many of the newer video cards are increasing in cores, memory, and speed all the time. How much work have you done leveraging the GPU vs CPU?
ReplyDelete
Replies
UnknownMay 24, 2012, 9:46:00 AM
How do you manage rendering, when OpenGL for example can only handle being rendered to from one thread?
ReplyDelete
Replies
UnknownOct 10, 2012, 8:57:00 PM
Hi, I've also created multithreading (separate update and render thread), but don't come the conclusion that it isn't worth it. True, it is very difficult to implement decently, but it is worth it for games that require lots of CPU. The goals is to seperate the render-state from the world-state so you can triple buffer the render state. The delay you talk about can be 'solved' using partial synchronization and extrapolation.

For an explanation of the problem please check out:
http://blog.slapware.eu/game-engine/programming/multithreaded-renderloop-part1/
http://blog.slapware.eu/game-engine/programming/multithreaded-renderloop-part2/

The website discusses the problems and shows a decent (though difficult) solution. Both source and binaries can be downloaded.
ReplyDelete
Replies
UnknownOct 12, 2012, 3:35:00 AM
I sense you too have a lot of knowledge about the subject, which makes this an interesting discussion. I'll try not to make it into some sort of flame-war between two techniques, because I really like to read what you have to say :)

1. Mutexes
I totally agree with you. Mutexes should be prevented wherever possible.

2. Determinism is possible.
Floating point errors occur, true, but if the floating point errors are consistent then there isn't any problem. This is also possible cross-platform because floating-operation are standardized. Most compilers even have flags to allow you to change the way that floating point are handled.
Network issues are no problem either, unless you use UDP (or some other protocol). For TCP/IP the order of the packets are guaranteed. This does not mean that packets arrive at your computer at the same time, but TCP uses a buffer and automatically asks for re-transmissions if needed. Your application will receive all packets in order. If there are any problems (some packet will never be received, even after asking for re-transmissions) the connection will just be dropped. Packet corruption is handled by check-sums (in both ip-packet-header and tcp-packet-header)

The complexity of a solution is never a real issue for me because I enjoy figuring out what the best approach is (for my situation). The technique I used is not as far fetched as you might think. It is a technique called triple buffering (not to be confused with triple buffering on the video card). CryENGINE 3 is another one of those great tripple-A game engines. I'm not going to compare these engines, lets just say that most these engines are pretty darn good. I'm certainly not going to compare something I have written in 1~2 weeks with a tripple-A engine. Especially because I have written this implementation in c#, lol. It is very difficult to look at a certain engine and determine what techniques they apply. They often limit themselves to saying things like "we are supporting volumetric clouds" or "we have a taskmanager to optimally use all cores in your CPU" without diving into the specifics of 'how' it is implemented. I wouldn't be surprised if these engines are also able apply a form of tripple buffering on render states.

I also agree with you that you'll need some sort of task manager to optimally use cores in your CPU/GPU. And the technique you describe seems pretty solid. We actually agree on most topics except for the one where I prefer a separation of 'task-manager' for GPU and CPU, while you just say that just one task-manager for both is good enough. I also use triple render-state buffering, but I probably wouldn't have implemented that if I didn't need determinism.

A task-manager with 50 workers for a dual-core CPU is silly as you probably agree with me. There should be some balance between the nr. of cores and the number of threads. The GPU and CPU are intrinsically different. They contain a different number of cores and the are connected through a bus that only allows one 'message' to be send to it at a time (protected by mutexes in the video card driver). Since the bus is protected with mutexes it should also seem silly to have more than 1 threat to try and send a message to the video card. Because they are so different it makes more sense to separate the queues so we are able to fine tune these queues (perhaps even setting thread-priorities).
ReplyDelete
Replies
UnknownOct 12, 2012, 3:34:00 PM
"If that is your definition of determinism"? Come on, it is a normal English word with a clear-cut definition. It is not like I'm making up words here. It's also really important for games (not all type of games though), so I thought you would be fairly familiar with the term.

My extrapolation does not invalidate the physics state. Assuming the update-thread runs at 60 FPS then at worst case the extrapolation will account for max. 16.6 ms of movement (and ~8.3 ms on average). This extrapolation is only used for rendering and not for modifying the physics state. If something goes through a wall because of extrapolation then at the next update frame it will already have been set to the right coordinates.

You keep pointing out problems where there are none, yet you continue to insist that there are. The comments you make, like the one on invalidating a physics state, makes me doubt you have actually read it. I get the impression that you just have glanced over it and thought that it's just some overly complicated jibberish. I don't mind if you don't want to read it, but please don't come up with obviously wrong 'problems' that my engine has.

If you make all calculation depend on the previous frame and gather input at start of the frame, then you're missing one important step. The physics should run at a fixed steps per second, shall we say 60FPS? Now imagine, if you will, your game being rendered on a 120 Hertz monitor. If you decide to render at 120 FPS you're only rendering 60 new frames per second. Any discrepancy between update-FPS and render-FPS will lead to nasty side effects! Simply increasing how many times you update per second won't help you that much. For your engine it is really weird to fix the update-thread to a certain FPS, as they are just 'jobs added to the queue'.

The best thing you said is "I have absolutely no idea how ensuring that your engine was deterministic had any effect on your engine design.". Maybe this is the reason why I have such a difficult time explaining to you how determinism should be enforced. On the other hand you say "every single solution I have ever built for this [...] will be deterministic", another obviously wrong statement. If you render your game at 1000 FPS how often you call your physics library? 1000 per second times right? You want to update the positions 1000 times otherwise rendering with such a high speed has no real benefit. Calculating physics with a time step determined by rendering speed will break determinism. Box 2D physics are only deterministic of the timestep is constant..

Civilization 5 is a turn based game which simplifies creating network-protocols and replay-files, so there is no real added benefit for adding determinism in such a game. This is why I stated previously that the engine for Civilization 5 has it very easy. An RTS has way more hurdles to overcome so I'm more impressed with the engine used in Supreme Commander 2 and Starcraft 2, even if they don't look as beautiful. It is ignorant to think that CIV's approach is a one-solution-fits-all, which is exactly why I've posted the links on a deterministic non-blocking render loop.

If you'd like, I can come up with some resources that explain what determinism is, why it is important for (certain type of) games, why it is difficult to have determinism without fixing render-FPS and how determinism helps in simplifying network protocols for games (among other things).

Do you now understand why your game engine is not able to be deterministic without severe consequences?

Thanks for the link on CIV5's engine presentation. I really wish I was there to hear about the details :) Looking at those sheets really makes me want to continue experimenting with game engine techniques.
ReplyDelete
Replies
UnknownOct 12, 2012, 4:33:00 PM
I didn't SAY your extrapolation invalidated the physics state, I said that you RENDER an invalid physics state!

I keep pointing out problems and then you keep misinterpreting them.

Uh, every single solution I have ever written involves constant timesteps. Because I use Box2D. So yes, it is all deterministic. Sorry. Doing physics with a constant timestep is as simple as using box2D's accumulation method, which my method would do, and furthermore I argue that trying to render frames in between physics calculations is worthless because the extrapolation simply introduces artifacts, and if you can't render physics at 60 FPS you're screwed anyway. Hence, physics are blocking, and if they aren't blocking you either 1. aren't deterministic or 2. are extrapolating positions widely and are therefore rendering frames with an invalid physics state, even if the internal physics state is correct.

So I contend that trying to render faster than the physics timestep is useless, and therefore your solution is useless.
ReplyDelete
Replies
UnknownOct 12, 2012, 5:47:00 PM
rendering an invalid physics state when using extrapolation is sporadic and if it happens it probably doesn't even take 10ms. So you are telling me that you are limiting your game to 60FPS just to be deterministic? Wow, gamers with a monitors with any other frequency than 60 Herz must be pretty disappointed with your engine. Wait, even if you do not want to be deterministic you are using fixed time steps? You are aware that it is possible to use dynamic length time steps so the physics keep up with the render FPS, right?

Let me guess what you are going to say: "If the player wants 120 Hz, I'll just crank up the physics to 120 FPS", showing that you still don't get what we are trying to achieve by ensuring determinism.

I 'can' render faster then the physics time step. If physics runs at a meager 20FPS, everything still runs smoothly (because of a render FPS of 1000+ FPS and extrapolation) I can't even see those nasty physics artifacts you seem to dislike so much. At an update of 60 FPS this becomes even less of a problem. If you can call it a problem, as it isn't even noticeable.
ReplyDelete
Replies
NecrowizzardNov 26, 2012, 1:11:00 PM
Besides the 'flame-war' going on.

There are actually many disadvantages from such a multithreaded architecture, but I currently still like it.

The need of conceptually dividing the functionality turns out to be positive for maintainability in my case. The borders between the modules are hard to overcome and therefore you can't take the lazy path of just hacking something in between. A lot of trouble comes from adding fast features that are not separated enough. So in general I found that it makes it easier to work on individual problems. Still the overhead is large.

The general problem is that writing a multithreaded game engine isn't just a work of a few days, therefore the actual performance differences are hard to measure. My current architecture runs with 4 different modules which are basically communicating via double-buffering. And I would also encourage others to pursue a similar path, if they can live with a steep curve of getting things together in the first place. I blieve that separating really pays off when going further along the road. Currently I wouldn't go back to a single threaded/module approach. Still it is possible that my opinion may change in the future.
ReplyDelete
Replies

Add comment