Overview
Source of Motion Vector
Velocity Map Rendering
Visualize Velocity Map
Motion Blur Rendering
Flatten
Min Depth Trick
Implementation Details of Parallel Reduce
Tile Scatter
Tile Classify
Motion Blur Filter
Color Sample

Overview

From: https://docs.unrealengine.com/4.27/en-US/Resources/ContentExamples/PostProcessing/1_12/

Motion blur blurs objects based on its motion. The system works by way of a full screen velocity map that is created at a reduced resolution and objects are blurred based on their contribution to this map. The image below shows a visualization of the velocity map available in-editor or while your game is running.

1.12 - Motion Blur

An overview of the Post Processing example level, example 1.12: Motion Blur

docs.unrealengine.com

Therefore, the rendering of Motion Blur can be seen macroscopically as two parts:

Rendering Velocity Map
Blurring in the corresponding direction based on Velocity Map

Source of Motion Vector

Motion Vector is not equivalent to Velocity, in fact, we should think of Motion Vector as follows:

In screen space, where is the position of the current pixel in the previous frame's screen space? The difference vector between them is the Motion Vector.

Furthermore, there are actually two sources of Motion Vector:

Camera
Object itself

The movement and rotation of the camera will cause a global displacement of the pixels on the screen. And obviously, the movement of the object itself will result in the movement of the corresponding pixel.

Velocity Map Rendering

In the Base Pass of GBuffer rendering, the vertex's previous frame position is calculated based on the transformation matrix of the previous frame for each object. Then, the current frame position is compared with the previous frame position to calculate the velocity vector.

Visualize Velocity Map

Through the Show → Visualize → Motion Blur option, you can visualize the Velocity Map.

Motion Blur Rendering

Flatten

ℹ️

UnrealEngine/Engine/Shaders/Private/MotionBlur/MotionBlurVelocityFlatten.usf

The calculation of Velocity Flatten is shown in the following diagram:

Converted pixel Motion Vectors from Cartesian coordinate system to polar coordinate system.
And through a parallel process similar to Reduce-sum, the range of polar coordinate velocities in the current local region was calculated per-tile.

Min Depth Trick

When calculating Motion Vectors, instead of directly calculating based on the current pixel position, a fast search is performed on the surrounding area of the current pixel. The pixel with the lowest depth (i.e., closest to the camera) among the surrounding pixels is used as the reference for calculation.

According to the code comments, this helps generate higher quality contours.

Implementation Details of Parallel Reduce

In the ReduceVelocityFlattenTile function, you will see the following code:

	FVelocityRange VelocityPolarRange = SetupPolarVelocityRange(VelocityPolar);
	WritePolarVelocityRangeToLDS(GroupIndex, VelocityPolarRange);
	GroupMemoryBarrierWithGroupSync();
	VelocityFlattenStep(GroupIndex,  128,  VelocityPolarRange);
	GroupMemoryBarrierWithGroupSync();
	VelocityFlattenStep(GroupIndex,  64,  VelocityPolarRange);
	GroupMemoryBarrierWithGroupSync();

	VelocityFlattenStep(GroupIndex,  32,  VelocityPolarRange);
	VelocityFlattenStep(GroupIndex,  16,  VelocityPolarRange);
	VelocityFlattenStep(GroupIndex,   8,  VelocityPolarRange);
	VelocityFlattenStep(GroupIndex,   4,  VelocityPolarRange);
	VelocityFlattenStep(GroupIndex,   2,  VelocityPolarRange);
	VelocityFlattenStep(GroupIndex,   1,  VelocityPolarRange);
	OutVelocityPolarRange = VelocityPolarRange;

I had a confusion when initially reading this piece of code:

Why do 128 and 64 require GroupMemoryBarrierWithGroupSync, but the subsequent ones don't?

I found the answer in this Slide:

on-demand.gputechconf.com

Simply put:

If we imagine the GPU as CPU threads, then each call to VelocityFlattenStep must be accompanied by a GroupMemoryBarrierWithGroupSync to ensure that the next call to VelocityFlattenStep can obtain the latest data written by other threads.

This imagination is correct for the cases of 128 and 64.

However, when the number of threads is small, small enough to be less than a GPU Warp, at this point, the entire Warp is actually executed in SIMD form. We can imagine that all threads execute the code line by line synchronously.
In this case, once the VelocityFlattenStep function is executed by one thread, from the perspective of the entire Warp, everyone has finished executing.
Therefore, GroupMemoryBarrierWithGroupSync is no longer needed in this case.

Tile Scatter

ℹ️

UnrealEngine/Engine/Shaders/Private/MotionBlur/MotionBlurTileScatter.usf

Next, UE will attempt to scatter the Tile information of Velocity. That is, it will spread the maximum and minimum velocity within the area on both sides along the direction of velocity.

The problem encountered here is that the direction of the Motion Vector may be tilted and its length may be long.

If we use a quat computer shader based kernel to scatter the info, the efficiency will be very low. Therefore, UE cleverly implements this using the existed rasterization system.

How to calculate the scatter area?

UE will first use Instanced Rendering to draw the same number of square faces as pixels.
In the vertex shader, it samples the Tile it belongs to and moves the vertex position based on the maximum velocity in the Tile.

This effectively expands the square faces based on velocity.

Then, it rotates itself along the velocity direction, solving the issue of potential velocity direction skew.

So, how does UE calculate the minimum and maximum velocity within a region? UE uses Z-Test again to solve this problem:

First, it sets the depth test to Less and directly outputs the length of the Min Velocity as depth. This means only the value of the pixel with the lowest velocity in the overlapping region will be preserved.

Next, it sets the depth test to Greater and directly outputs the length of the Max Velocity as depth. However, only the B and A channels of the Render Target are written (corresponding to Max Velocity). This means only the pixel with the highest velocity in the overlapping region will record its information in the corresponding channel of Max Velocity.

Tile Classify

ℹ️

UnrealEngine/Engine/Shaders/Private/MotionBlur/MotionBlurFilterTileClassify.usf

Since we only need to focus on the Tiles that actually have Motion Vectors, the UE chooses to merge all the Tiles in the frame. This is similar to the merging system used in the previous Lumen analysis process.

If you still find it difficult to imagine the merging part in the middle, it's normal.

Let's focus on only one type of Tile and assume we have only 4 threads divided into 2 groups. This diagram might help you better visualize what is happening:

Motion Blur Filter

ℹ️

UnrealEngine/Engine/Shaders/Private/MotionBlur/MotionBlurApply.usf

Color Sample

The total number of samples is SampleCount, which is calculated based on the length of the velocity vector and will take the maximum value within the current Tile range.
Afterwards, the Sample Count will be divided into groups of four.

Each group has the same base length.
However, the length of the sampling vector in each group will have two random perturbations.
Sampling will be done in both positive and negative directions.
Therefore, it is four samples per group.

Compare: Sample only motion vector direction vs. Sample both positive and negative directions:

Only one direction

Both directions