Parallel with GPU

How GPU Works

It is a challenging task to explain "How GPU Works" in a very short space, and there are already numerous resources available that explain this very well.

Therefore, in this chapter, I will simply explain the prerequisite knowledge required to understand the upcoming topics.

A Large Number of Threads

You can think of a GPU as a CPU with many cores.

Furthermore, on top of this, the GPU can also manage a huge number of threads at the logical level and allow them to work in parallel logically.

SIMT

However, if we want to manage and run these threads independently like a CPU, the hardware resources required ultimately make the GPU more like a giant computer. Therefore, the GPU must make trade-offs: its cores are not as complex and powerful as those of a CPU. Additionally, GPU threads are managed in groups, where threads within the same group execute the same instructions step by step.

Compute Shader

A Compute Shader is a Shader Stage that is used entirely for computing arbitrary information. While it can do rendering, it is generally used for tasks not directly related to drawing triangles and pixels. From: https://www.khronos.org/opengl/wiki/Compute_Shader

Unreal Engine extensively utilizes Compute Shader for GPU-driven logic processing, including various tasks such as software ray tracing calculations and GPU-based culling.

Parallelism is Important

Compare to the left, the right one spends less time, and fully used the cores
Compare to the left, the right one spends less time, and fully used the cores

For these logics executed in the form of Compute Shader, parallelism is a crucial aspect.

If one thread within the same group has a higher workload while others have redundant workloads, the other threads will be idle and wait.

This is one of the optimizations that can and should be focused on without changing the computational complexity of the algorithm.

Persistent Threads

In the Compute Shader-related Passes of Unreal Engine, Persistent Threads are extensively used to reduce the number of Passes.

This means that within a Compute Shader Pass, the same Thread may have different responsibilities in different stages of the Compute Shader. For example:

image
  • Thread 0 is initially responsible for copying data from VRAM to the Group Shared storage space.
  • Next, Thread 0, along with other Threads, acts as a Work thread and executes processing in parallel.
  • Finally, Thread 0 may gather all the results and write them to the output Buffer.

This technique reduces the number of Passes, eliminating the need for three separate Passes based on the different responsibilities of Threads.

Thread Mapping

In the following content, I will discuss Thread Mapping multiple times. Due to the design of Persistent Threads, I need to explicitly explain how tasks are allocated to different groups and threads within each execution stage of the Compute Shader.

If you are not familiar with Compute Shader, please make sure to read the following content to help you visualize how they work in parallel.