multithreading

Tea For God is a VR game. This means that it has to render two images for each eye. And it has to do so in 90 frames per second. There are several approaches to rendering and there are more and more being developed. I tried a few and one that worked the fastest for Tea For God was the simplest one, rendering two separate images. Note that for different engines, for different projects there might be different solutions that are the best.

But it’s not a post about rendering. It’s a post about multithreading. If you know a thing or two about multithreading, skip the next few paragraphs. One more thing before we go, initially I was considering breaking it onto a few posts but decided to just have one bigger post.

On a single thread, all actions are performed in a sequence. One after another.

What is multithreading? A long time ago, I exactly don’t remember when, but it was more than 10 years ago, maybe more like 15 or even 20, most computers had a single core. Most of the games were running single threaded. This means that everything was happening after one another. Every single thing. In general, first the input was read, then game logic was run, everything related to it and then that very frame was rendered and presented to the user.

Multiple threads allow performing two or more actions at the same time.

Then multiple cores came and developers realised that they can run few things at the same time. Imagine that each core is a person and they are given specific tasks. With just a few cores, the easiest approach was to have one frame do rendering, the second do the game logic and if there is another one, it can stream things or maybe do some background tasks. This shift meant that we had to prepare a game frame but we couldn’t render it while we were preparing it, so we only could render a previous frame. So when we present a rendered frame to the player, there might be already another game frame ready. The player sees past. But that is exactly what happens in our brain anyway. What we see is not there really is, but what happened some very short time ago. This happens due to the synchronization of different signals that come from our body. So it’s not such a big deal when there is high enough framerate.

Because it is not possible to render a frame that is currently processed, in multithreaded games, while we are processing a new game frame, we render the last one. Note that a rendered game frame is presented to the player when a new game frame is already processed.

We shifted to rendering a previous frame, but we were still preparing a game frame on one thread. With 2 or 4 cores available, this was not a big deal. But as many computers now have 8 cores and soon will have more, this puts us in a situation when some of the cores remain not used by the game. Some of the developers decided to change that and there are numerous different approaches on how to deal with multithreading.

This post covers how I dealt with multithreading in Tea For God.

In the game, I try to use all available cores (I have a system that tells how many cores can be assigned to a given type of job, each core may do few different things). This required most of the task to be broken into most basic, simple tasks and to keep communication between them. Back to the “cores are people” example. Imagine that two different people want to use the same thing. For example, take a drawing board.

Simpliest case of locking. If this would be happening thousands of times per frame, it would mean that a lot of time is spent waiting. For very rare cases it is acceptable, though. At least by me.

While two different people are able to look at it or read anything from it, they should avoid situations when they both draw or write on it. Sometimes nothing bad will happen, but sometimes they will run into each other. And with this understanding, they can’t use it at the same time. The simplest solution is to have them to queue. But this means that one of them will idle. In my opinion, this is acceptable as long as they don’t idle for too long. “Too long” here means that we are a sole user of something just for the amount of the time that we deal with it. We should not block anyone else while we do something occasionally and we do lots of other stuff.

The main thread is at the core 1. System and render jobs are done there. Before it starts to render, it focuses on preparing a scene to render. If it ends rendering and there is still work to do, it will assume the role of a basic worker. Core 2 helps with preparing a scene to render. Core 3 gets the sound scene, both preparation and dealing with the sound system. Core 4 is responsible for background tasks, these are nav-mesh tasks. Loading is not used at the moment as everything sits in the memory. Async is to allow asynchronous jobs, although because of how I approach them, there might be just one asynchronous job being processed at a time. Workers are doing everything else – actual game frame. Note that this is a setup for a given frame. What really happens is that prepare render, sound can be at any core. And one worker also assumes the role of a game loop handler.

But this is acceptable only when it happens occasionally. And many things during a game frame happen not just every frame but dozens or hundreds of times during each frame. We just can’t lock everything all the time. But – we can arrange the work in such a manner that in a given period of time, all tasks will just read common data without modifying it and will prepare stuff in separate parts of memory so following tasks could read it and do something else.

If it is not possible to make something parallel without locking, and we have very clear functionality in a given step, it might be better to divide it into more steps that could be parallelised more easily.

And with this idea in mind, I divided a game frame into lots of small steps. This is kind of obvious because the game frame is divided into a lot of steps, but I pushed it a bit more. Divided it into steps that have a common input and separate outputs (and don’t require to know what’s in other tasks output). For example, there is a step that is responsible just for gathering data about collisions. It checks the world, for each object it checks if it collides, how it should move to resolve the collision but doesn’t resolve any collision. This happens in the following steps. The next one prepares movement. Takes into account what AI wants, what player wants and what collisions tell us is possible. But it still does not do any movement. Up to this moment, we don’t modify any general state. All we change is the internal state of objects. What’s more, we don’t change their actual velocity, because that might be used by other objects to determine how to behave. We actually just set up the “next” velocity. When everything is solved, we know how we move and where we move, we… just move. And this is where we modify anything in the world. This is also when some conflicts may occur. But they may happen only sometimes. When we move objects from one room to another. And that’s when I have locks. Because movement between rooms happens rarely (in terms of a frame, with 90 frames per second, with a few dozens of NPCs, some of them standing, some moving within a room, changing rooms happens really rarely).

This is a game loop zoomed in, running on eight cores. The highlighted, core 2, is the game loop, running game logic, physics etc. You can notice that 8,2% of spinlocks wait at some time. These are visible as red lines. Most of them happen during render scene building and presence links building. They will be investigated.

This should explain it to you, the basic approach of how I deal with a game frame. Now, I will present you a list of all parts, everything that happens during a game frame. The list is quite big at the moment but I want to show you have fragmented the frame is. This is only the game frame, without system reading input, rendering, audio etc:

  • process AI messages
  • get collisions information
  • get presence information (gravity, a surface we’re walking on etc)
  • AI perception (check what AI sees)
  • AI logic (this is when AI thinks, this is done with latent functions, I shall cover the AI in a separate post, the most important thing here is that
  • AI does not change anything in the world, just gives orders what should happen)
  • locomotion (change from how and where AI wants to move into actual velocity we want it to move, this is also a step in which navigation data is used, but the navigation paths have to be prepared in the “advance logic” part)
  • doors logic (decide whether to open or shut them)
  • prepare movement, the first part (take everything from above, collisions, AI requirements, also what is possible for given object and calculate velocity that we would like to use in this frame)
  • animation logic – at the object level
  • calculate preliminary poses (that results from the animation logic)
  • animation controllers’ logic (this is when animation controllers advance but they are not doing actual IK calculations etc, they just plan their actions)
  • prepare movement, the second part (take any feedback from the animation and get the actual velocity)
    NOTE: up to this point we haven’t changed a thing in the world. We did a lot of work, but we haven’t changed a thing in the world. This is quite important and I will describe why, later on. Everything before this point is called “a logic part of a game frame”. Everything later is considered “a physical part of a game frame”.
  • move everything using velocities prepared in previous steps
  • post move actions (this is for presence module only, just for storing some data and triggering kill Z)
  • open and close doors
  • advance Points of Interest in rooms
  • advance Points of Interest in objects
  • post move actions (this is for gameplay modules)
  • calculate final pose (this is the final animation advancement) and update attachments
  • custom modules
  • timers
  • advance temporary objects (these are particles, projectiles etc. basically, this part is everything above, but done only for those special objects)
  • build presence links (“presence links” define in which rooms objects are)
  • finalise frame (in this step, objects are readied for rendering: bones’ transforms are stored, materials are prepared; also sounds are advanced)

But this is only the gameplay side of a game frame. We’re still left with rendering, sound and things that do not fall into a game frame.

Render/system thread is at the top (core 1). To the left of it, there is system messages handling, VR advancing, calculations of player’s hands after the latest VR poses are read and few other small things. Of the highlighted, on the left, there is a render scene building (there is a second render scene for the second eye on the core 6). Next, in the middle, there is a big space – this is waiting for the rendering of the last frame to finish. Then, the last bit is the rendering of the current frame. This is also when the latest head pose is read and render scenes are modified to reflect that. On the core 5, the highlighted bit is preparing a sound scene and sending it to the sound system.

Rendering and sound are handled in a similar way:

  • build a scene (create a proxy for everything that there is in the world and its state (that is relevant for rendering/sound). For the rendering, we actually create two scenes. One for each eye. For the sound, we’re ok with one scene. This has to be done within a logic part of the game frame. All scenes are prepared on different threads. This allows to deal with them as fast as possible and not stop game frame advancements.
  • render the created scenes or update the audio system (FMOD in my case). For the rendering, we do one extra thing. Just before we render a frame, we read the latest pose from the VR system, to have the most actual camera location. Because we already have built the render scene, we have to alter it a bit if required (especially when because of the pose difference, we have moved through a portal.
The highlighted part is the game frame’s logic part. This is when nothing in the world is modified. The green line is when render scenes are created and the world can start getting modified. The yellow is when things are actually getting modified. Note that for more complex render scenes, when there are more rooms visible, the render scene preparation may take more time, moving the green line right.

You should now notice, that I try to pack as much work that doesn’t affect the actual state of the world in the first part of game frame advancement. The latter happens on any free core as the render/sound threads have the highest priority. Also, the rendering thread is the main thread. There is a one selected thread for game frame advancement but it only creates tasks/jobs and deals with a few other things that I will soon cover.

Very dark and very light rectangles are actual jobs being performed. You can notice that before each one gets started, there is a bit of time spent on picking up another job to perform. The total time required is 0,0657ms. 0,0046ms is spent on collecting jobs to process.

It is important to mention that job management takes some time. Switching between jobs/tasks also takes time. That’s why I batch them. Batch size depends on how many jobs there are. The more data to process, the bigger batches. This way I have as low idle time as possible and I also benefit from batches. At some point, I was considering storing the advancement time for each step but it at the moment this is not required. Game frame advancement fits nicely with a huge extra buffer.

The gaps between individual jobs are gone and the time went down to 0,0524ms. Even if collecting jobs requires much more time when compared to non-batched (0,0083ms). These times are pretty consistent, although the number of samples I compared these times to, is quite small, 10. Still, this could be improved – you should notice that one core didn’t do anything (not exactly sure why) and the core 5 was the one that took more time because the load was not even. And yeah, the game loop time is much higher but that’s not related to the batching.

There is also one more optimization that I have. Some of the tasks are not advanced every frame. If an object is not visible (actually, if a room the object is in, was not visible for some time), it may skip some advancements, collisions, AI (which is latent anyway), animations. This saves a lot.

This pictures how often a particular advancement is done for various objects. For a player, we always have to advance each frame. If we have NPC that is not next to us but close enough, we could skip each odd frame. For more distant NPCs we may skip 2 or 3 frames. Sometimes, if it is possible, it might be good to skip updating at all.

When I was working on AAA games, we had to resort to such optimizations when it came to animations. The more distant an object was, the more frames were skipped for it. This led to very strange bugs. It turned out that other parts of the code were dependent on animations advancing each frame. Sometimes it was for particular states, sometimes for particular objects – we had to make sure that in these cases, animations were advanced every frame. To avoid having such bugs that result in disabling the optimization, I decided to use this approach so early in the project time. The performance gain is significant. The game loop time goes down for about 20-40%.

Latent code in c++. What happens here is that code seems to pause for 1 second. Actually, it just jumps out and we get back to it at a proper time. I based my solution on an article by Maciej Sawitus.

Oh, there’s one more thing similar to this. AI code uses latent functions. This means that for many frames the AI is just waiting. Then does a bit in one frame and waits more. If you have lots of AI characters, you may end up with doing some heavy stuff related to logic only for a very few of them in every frame.

Some of the asynchronous tasks may be processed for much more time than one game frame takes. Most of the work might be done without accessing data processed during a game frame, but when we need to access that data or modify it, it should be done in a synchronous window of a game frame.

Back to the topic. I showed how game frame advancement, rendering and sound are connected to each other. But this is not everything that is happening in the game. There are also extra tasks that may span over a few frames. Nav-mesh building, various asynchronous tasks.

Nav-mesh building is quite obvious. When I create a room or change something in it, nav-mesh is built. When the nav-mesh is built, we have a request for a new nav-mesh building task, we manage all that at the beginning of a game frame. Before we create scenes and advance game related stuff.

World generation is actually a lot of smaller asynchronous tasks. Currently, it is possible to offload work to a different core and allow it to be processed “in the background” while the game is advanced in a normal way. With a single thread, one would have to break longer tasks into really small bits and process each bit during a normal game frame. This approach required special treatment of tasks, allowing to interrupt them and continue them at particular times.

Various asynchronous tasks are anything that takes some time and does not have to be done immediately. These might be a generation of the world, adding details to the rooms, spawning NPCs etc. Most of the work in these tasks does not affect the world. They are completely separate from anything that happens in the game. Most of the time. Because sometimes we have to sync with a game frame. We want to read something from the world or put a newly created object into the game. Any asynchronous task may switch to synchronous mode. Just for a very brief moment. This moment is just after nav related tasks management and before scenes and game stuff. Most of the times, when an asynchronous task wants to do that, it will be the one waiting. The game frame is much more important at actually it just allows an asynchronous task to do something.

This is a zoomed-in asynchronous task. On the core 1 (running render/system thread) you see a green bit (this is pre-advance (game) function), then there is a small blue bit followed by a longer one. The longer one is rendering (preparing render-scenes and actual rendering). That small blue bit is managing jobs. The system looks for a next job, this might be either asynchronous or synchronous one. Synchronous jobs are performed immediately, asynchronous jobs are issued and delegated to another core. Here, the core 2 takes on the asynchronous job. That huge grey part is actually waiting to switch to synchronous mode, which happens just after the game loop is processed. Being in synchronous mode is displayed here as small red rectangles.

There are a few additional mechanisms on top of that. One of them is related to object activation. Because we may have multiple objects or nested ones, we don’t want to activate (put them into the world) one after another but all at once. We queue them and then mark them to be processed. The actual activation is divided into two parts. One, getting objects ready to be activated, this is an asynchronous job and it may require creating new objects (attachments etc). The second part is a very simple synchronous job – to add readied objects to the world in a batch.

If we would add asynchronous tasks to spawn objects 5 and 6, object 5’s attachments would be processed after object 6 was spawned. For two objects this doesn’t seem to be an issue, for few dozens you get objects missing or appearing out of anticipated order.

Another example of such a system is “delayed object creation”. It just helps to create objects in order (because each object may have sub-objects that should be created together). If we would like to create three different objects and we would add asynchronous tasks for each one AND each one would require to have more asynchronous tasks created (because we want something to be done, but not right now, after we finish doing the current thing), all of them would interweave. To avoid that, object creation tasks are put into the delayed object creation queue and a new object is created only if there are no current asynchronous tasks running/queued.

The core 2 is running long asynchronous job “create next part” (this is creating a new level”). During that time it creates a level layout, rooms, doors, connects stuff, generates rooms themselves and so on. Every time a new important thing is created, make it a room, a door, an object, it has to be done as a synchronous job. Such synchronous jobs are represented here as red rectangles (on the core 2). As you can also notice, the active level is much smaller at this moment (because it’s not yet created, right?). The window for synchronous tasks is when the game loop is not processed (that huge space in the lower part).

Very important thing is to remember, that during asynchronous tasks we should not try to access anything that exists in the world, or the world itself. The world is constantly changing and at different periods of time, different things in it are being modified, added, deleted, replaced. That’s why asynchronous tasks run beside the game and hop in during the synchronous window to do something.

Asserting asynchronous work or Multi-Read / Single-Write guardians are there only in the development builds to make sure that a given part of the code is executed with certain assumptions.

Problems I run into? Many. I have lots of concurrency checks to make sure I run either in a synchronous task or in an asynchronous task. I have lots of mechanisms to make sure that something is read/written when it is expected to be read/write. All that stuff is only running in a development mode. As I already mentioned, I also have locks (spinlocks and multi-read/single-write locks) that might result in short waiting times. They are done to make it easier to wrap your head around what’s going on and to avoid adding extra mechanisms to queue stuff, process and distribute.

One of the problems I run into quite early during the development was trying to avoid breaking frame into more and more steps. Especially when I noticed that time required to just get all required objects and process them was getting bigger and bigger and at one point, there was more time spent administrating tasks. I solved that by batching jobs but also selecting and marking objects that require something to be done in a particular step. Ideally, it would be great to avoid having a very little work being done, but if it is not possible to put it somewhere else, it’s better to waste that extra administration time, but have a clean code.

I had one single moment when I was devastated and wanted to give up. It was when I added lots of background objects to the game. Two things happened. The level generation time rose to two minutes. From 10 seconds. And the framerate dropped from much more than 90 fps to 20 fps. I didn’t know which one was worse. And I didn’t want to get rid of all those background objects. Solutions came quite quickly:

These are two different windows from two different zones, sharing the same vista. The vistas are separate entities that can be referenced/shared by different rooms.

First, I decided to share vistas between windows. Most of the times you see same stuff when you look out of the window. Having different light direction taken that is applied during rendering only, helps too. Hey, smoke and mirrors!

This is what is output to the console in development builds (that’s also why it is much slower and there are “SLOW” warnings). Right after the level is generated (“generated current region in 6.42s”) the game generates one navmesh and proceeds to handle delayed object creation. One after another.

Second, I divided everything into “we need that to have the level running” and “this can wait”. This sped up level generation time a lot. It was then just 5 seconds to generate the required content. Everything else, the NPCs, decorations, vistas etc. are created when the level starts. Because the station door takes a while to open and because the player movement speed is limited (how fast can you run?), there is a lot of additional time to create all that stuff. And in the final game, a player may want to buy/sell stuff, craft something etc. This is when I added more complex asynchronous tasks management (before it was just “add asynchronous task”, after that I had the synchronous window, world jobs (asynchronous tasks) and “delayed object creation” queue). Right now, the levels are more complex and there are more NPCs but the shorter level still takes just a few seconds to be created.

It is enough to make an object static. And saves lots of CPU when you have dozens of such objects.

I still had to deal with the framerate. That’s when I introduced a “static object” marker. If an object does not change during its lifetime, it’s advanced just once and then left as it is. It can be switched back to an active state, if there is such a need.

I was back to short level generation time and framerate back to more than 90. With lots of additional objects in the scene.

I heard many times that multithreading may result in the worst horrible kind of bugs. And there are a few kinds:

Task C requires the output of tasks B and E. Task B requires the output of tasks A and E. Task E required the output of tasks A, B and D. Tasks B and E are locked. Of course in real life, you wouldn’t have two tasks depending on each other. But you could have a few dozens of tasks out of each you could separate two chains that depend on each others’ result.

1. Deadlocks. Everything freezes because there is a task waiting for another task that actually waits for the first task’s results. Avoid such dependencies. It is tempting to just add a bunch of tasks and tell them what they require, what they provide and have a system auto solve that. Sooner or later someone will add a task that will break that careful chain of dependencies and put a loop in it. And it will get triggered just once in a while. Break the game frame into discrete steps with a defined purpose. Add more steps if it is required. Avoid putting some stuff inside something else just because it’s easier. If you can’t be sure what is happening in a particular step, this will hit you. Badly.

This example looks like something easy to avoid (always lock data A first, then data B). In real cases, it might be subfunctions locking stuff, lots of various things that can get locked, etc. Unless you keep locking things for very short periods of time, just to perform single and simple things.

2. Deadlocks may happen also because there are lots of stuff locked at the same time. A task wants to lock a few different locks. And another task wants to have access to the same stuff but maybe uses locks in a different order. Keep locking in the same order? Maybe, but better is to lock as little as possible. And for as short period of time as possible. There might be still issues with memory access and stuff though, you have to remember about that too. Such locking should be considered a solution only for cases that happen rarely and locks are there to prevent those one-in-a-million situations.

3. Something is broken somewhere because something else is doing something. That’s a lot of “some”s but when you look at multithreading bugs, you will end up with lots of stuff happening at the same time. And sometimes when you are notified about an issue, you check the code and you don’t see anything suspicious in the thread the bug occurred. To realise that there are 5 or 6 different threads doing their own duties and that one of them could do something just a fraction of millisecond ago and you can’t even tell which one because the call stack just tells you what is happening right now. This is kind of a similar problem to the spaghetti code. You just add one more dimension to it and you have a multithreading spaghetti. Logging stuff may slow you down (especially when you mess up something and introduced locks there, I had the locks in my performance measuring tools which resulted in just one task doing actual work and others waiting for performance tool to be available).

My approach is to try to prevent such things from happening. Clear input and output that don’t mess with each other. Have things separated. For most of the cases, it should be quite easy to do so. The biggest offender here is the gameplay code. Dealing damages etc. You can either queue stuff to process in a separate step or lock. Some of such bugs are easy to repro (with dual wielding, shooting from both guns at once at the same target resulted in damage code running for the same object at two different threads).

The core 2 runs the game loop (taking 4,24ms), other ones are either system/render or just workers. This is a bigger level, running at 90fps and there is still lots of time left to do stuff.

One thing that is good and bad at the same time is that with multithreading working properly, having lots of optimizations, you may end up using 30% CPU. Which may make you write some odd code that is not the best, the fastest one. Because why you would care if you still have lots of time to waste? At least make it easy to read. And limit such cases to AI and gameplay. They go way beyond what’s happening right now and sometimes require a bigger picture to understand them. But for example, collision detection code? Single purpose. Detect collisions. Do whatever it takes to make it run as fast as possible.

That’s it. This should give you an insight into how I managed to get the game running at 90fps, avoid common multithreading issues (at least neither me nor anyone who tried the game run into these – yes, I had crashes, but they were not related to multithreading). If you’d like to ask me something, share your experience, please do so. Even if you want to tell that this system is a pure nonsense. Well, it works, but I am sure that there might be better approaches and if not me, others will benefit if they learn about a better one.