Temporal AA and Super Resolution
Background
Aliasing
Anti-Aliasing
Execute Time
Overview
History Buffer and Feedback
Spatial Anti-aliasing
Update History
TL;DR
Overview
Thread Mapping and Sample Position
Filter Input Color
Input Color Sample Pattern
Weight
Filter History Color
Blend Between Input and History
Tone Weight
Speed-based Validity with Luma Clamping
History Weight
Warning: Implementation Changed
Resolve History

⚠️

The TSR section of Unreal Engine 5 is lacking in public information, so I cannot guarantee the accuracy and timeliness of my analysis in this section.

If your engine version is not 5.2, the source code you see may be completely different from what I see.
The following analysis cannot guarantee that it is the author's original intention, and I can only speculate on the original implementation ideas.

Temporal AA and Super Resolution

In the official documentation, TSR technology is described as follows:

Temporal Super Resolution has the following properties:
Rendered frames approach the quality of native 4K with input resolutions as low as 1080p.
Less "ghosting" artifacts against high-frequency backgrounds than was visible with Unreal Engine 4's default Temporal Anti-Aliasing method.
Reduced flickering on geometry with high complexity.
Supports Dynamic Resolution scaling on console platforms.
Runs on any hardware that supports D3D11, D3D12, Vulkan, Metal, PlayStation 5, and Xbox Series S | X.
Shaders are optimized specifically for PlayStation 5 and Xbox Series S | X GPU architectures.

Temporal Upscalers

How temporal upscalers are used in Unreal Engine.

docs.unrealengine.com

So, we should understand TSR as a combination of two technologies: anti-aliasing and supersampling.

Our next analysis is divided into two parts:

How to implement anti-aliasing
How to implement supersampling

Background

ℹ️

If you are already familiar with the knowledge related to Anti-aliasing and Super Resolution, you can skip this section.

Aliasing

What causes the appearance of jagged edges during rendering?

In the absence of supersampling, the shading of a pixel is determined solely by the color of the triangle covering the center of the current pixel, and is independent of its surroundings. This is not actually correct. If we subdivide the pixels and then merge them, we will obtain a color that is a blend of the colors of two triangles, which is the true color this pixel should have.

To summarize the problem we just discussed:

Aliasing come from our inability to capture information smaller than 1 pixel.

Anti-Aliasing

There are many ways to compensate for this deficiency. For the sake of brevity, we will only discuss methods related to TAA.

One approach, as shown in the image above, is to directly subdivide the pixels and then render, which means we need to make four pixel shader requests.

If we allow for some loss of accuracy, we can save some pixel shader execution requests. This is the basic idea behind MSAA.

Please note: the actual MSAA is much more complicated than this diagram. This is just a simple overview of the concept.

Another idea is, if we divide the 4 sampling tasks into 4 frames, can we also reduce the number of pixel shader runs? The answer is yes.

When rendering each frame, we slightly offset the camera matrix, so that the pixel centers fall on different triangles.

We can store the results in a sliding average buffer to achieve color blending. Alternatively, we can also record it in a higher resolution buffer. But overall, because we sample more pixels to reconstruct one pixel, we can get a smoother result.

ℹ️

Once again, a reminder: this is a "conceptual explanation". If you would like to learn more about how MSAA works, please refer to the relevant material. Here is a more detailed explanation. Regarding TAA, we will discuss it further in the following section.

Execute Time

However, before we start analyzing the two specific sections, let us first discuss the phases and features of the TSR's operation.

As shown in this image from the official website, TSR runs after the DOF rendering phase and by default runs before all Post Process effects are executed.

Overview

The following image shows the approximate execution process of TSR. Before we further analyze it, you may find it very confusing. Please don't worry, we will delve into many parts of it.

History Buffer and Feedback

As the name suggests, TSR needs to store historical information. In the diagram, the historical information that will be used for the next frame is marked with a dashed line. I call this "Feedback".

It should be noted that a portion of the HistoryBuffer's resolution is actually twice as high as the resolution of the final output after upscaling.

Spatial Anti-aliasing

ℹ️

Engine/Shaders/Private/TemporalSuperResolution/TSRSpatialAntiAliasing.usf

In this step, TSR analyzes the Luma of the input screen, and extracts the distance between the pixels in the image and the nearest edge, in order to guide the subsequent steps. It also outputs the Noise intensity.

Why are edges so important? In the era of UE4, TAA was criticized for often making the image look a bit blurry.

In fact, the human eye is more sensitive to edges. Therefore, when performing anti-aliasing and super-resolution, "protecting" edges is a very important thing.

This is an image from Unity 3D engine, but it can show the feeling of the ‘blur’ from TAA. An additional Sharpen processing will benefit.
From:

If you want to know more about the process, this flowchart below may help you visualize it. Please note that the height of the Cube in the image does not represent the height in 3D space, but rather the Luma intensity of the pixel.

First, TSR attempts to obtain differentials by analyzing the information surrounding the pixels. This is simply achieved by subtracting the central pixel from the surrounding pixels.
Furthermore, TSR analyzes the variance of the Luma of the surrounding pixels, which guides the subsequent calculations.
After this, TSR attempts to find the direction of the edges. Note that there are only two edge directions here: vertical and diagonal. All analysis is done in screen space.
Based on the edge direction, TSR determines the direction of the edge search and ultimately obtains the length of the edge. The maximum search length is 8 pixels in both positive and negative directions.
Finally, the edge length is stored in the output texture.

ℹ️

For readers who wish to personally read the source code: this section of the TSR code uses a technique called "dual pixel vectorization" to achieve calculations for 2 pixels at once in a single lane. However, this technology will not be enabled on the platform analyzed in this study, so it will be skipped in the analysis.

Update History

ℹ️

UnrealEngine/Engine/Shaders/Private/TemporalSuperResolution/TSRUpdateHistory.usf

To understand the underlying principles, let's start by discussing the simplest case:

If we want to distribute the rendering work of a frame over multiple frames, how can we do it?

One very simple idea is Checker-board Rendering. Many PS4 Pro games use this method to output 4K resolution images.

Why Checker-board? Imagine another approach: rendering only the top or bottom half of the screen each time. Then if the camera moves, we would see a very obvious dividing line in the center.

On the other hand, in a perfect world, the camera, scene, and light sources do not move. In such a perfect world, rendering 1 pixel per frame and waiting for a long time would still result in a perfect image.

This leads to an important issue for any system attempting to use temporal sampling to reconstruct higher resolution outputs:

How to reject unnecessary samples? How to determine the validity of historical data?

If we cannot answer this question well, we will encounter the famous Ghosting phenomenon:

Different game engines use different approaches. The Unreal Engine's approach has been extensively discussed in the official PPT, so be sure to check that out.

Now let's start analyzing the implementation of the Unreal Engine’s implementation.

TL;DR

When deciding to what extent historical information should be mixed with current frame input, TSR takes into account the following information:

Whether the current pixel has just been obstructed or just appeared: if so, the weight of historical information is reduced
The velocity information of the current pixel: if it is moving at high speed, the weight of historical information is reduced. However, additional processing is applied to high-contrast areas to avoid flickering.
The Luma information of the current and historical pixels
The reliability information of the historical pixels: this comes from the MetaData Buffer.

Overview

This is what we need to deal with when we only need to render a Cube. Note that I skipped the part related to transparent rendering to reduce the difficulty of understanding.

Overall, the implementation of this section can be seen as follows:

FilteredInputColor: Calculate the input color of the current frame and apply filtering.
PreHighFrequencyColor: Calculate the color of the historical frames and apply filtering.
Calculate the blending weights of the input color and history color separately.
Blend them together and write them into the output history buffer.

Thread Mapping and Sample Position

Start a thread that outputs the pixel count of the historical buffer's resolution.
Since the output history buffer's resolution is 4 times that of the input, the thread groups every 4 pixels together, samples the InputColorTexture at the same position, calculates and updates the 4 pixels in the output history buffer.

It should be noted that this is just a simple input-output correspondence. The description here only involves the "position of the sampled center pixel".

In actual implementation, the pixels around the center point of Input and Output will be sampled, so further analysis will be done in the following content.

Filter Input Color

TSR does not only mix based on one input color, but first analyzes and filters the central and surrounding pixels sampled from the input color.

Why? Please note that when we output the history buffer, the situation we actually face is this:

We need to output the result to the pixel location where O is located. What should the value of O be?

A very simple answer is: we directly output the value of K. If we do this, we simply repeat each pixel of the original input image twice around it. That is, all the green pixels covered by K have the value of K.

A smarter answer is: we sample not only K, but also the information of the 0-8 pixels around O by considering the position of O, so as to further improve the quality of our output O.

A simple example is: bilinear interpolation can produce better results.

TSR considers more information than bilinear and trilinear interpolation, which we will explain in the following.

When we talk about Filters, it inevitably involves the following 2 questions:

Which pixels to sample?
How to calculate the weight for each pixel to blend?

Input Color Sample Pattern

In the author's test environment, 6 pixels are sampled for Input Color.

Why 6 pixels?

If it is a cross shape, there should be a total of 5 pixels: up, down, left, right, and center.
If it is a square shape, there should be 3 x 3 = 9 pixels.

Neither seems to match the quantity of 6 pixels.

The best way to understand is that TSR attempts to sample an additional pixel in the direction of the relative offset between the input center K and output position O, based on the cross shape.

Please refer to the following image:

Weight

The pixel blending weight of Input Color is composed of two parts: SampleSpatialWeight and ToneWeight.

Let's first focus on SampleSpatialWeight:

To repeat, our current calculation is: for a pixel near the center of InputColor sampling, how confident are we in blending it with History to influence the current InputColor?

ℹ️

In other words, how strong is the correlation between the surrounding pixels and the one being processed?

From the calculation process, this weight reflects the following rules:

If the current pixel is disoccluded or off-screen, the blending weight selects OffScreenInputToHistoryFactor. This value is calculated by NoiseFiltering, considering the noise intensity near the current pixel.

If the noise is low, meaning the local pixels are relatively smooth, then the correlation is strong, and we can blend this pixel as well.
If the noise is high, we need to lower the correlation of this pixel to avoid losing high-frequency details and causing blurring. Obviously, this comes at the cost of introducing some noise.

Otherwise, it transitions between OffScreenInputToHistoryFactor and InputToHistoryFactor. The correlation now depends on several parameters:

LowFrequencyRejection: from RejectShadingPass. Determines whether the pixels in this area should be blended based on the low-frequency information of the image. If the low-frequency changes dramatically in this area, this pixel should not be blended, and its correlation is likely low.
IsRefining: calculated by CoarseRejectedPrevWeight and CoarseRefiningPrevWeight. These two weights respectively represent the historical weights related to coarse pixels when rejecting and refining historical data. These two weights are directly related to the effectiveness of historical data.

If CoarseRejectedPrevWeight (the weight when rejecting historical data) is less than CoarseRefiningPrevWeight (the weight when refining historical data), this means that the correspondence between the current frame and the historical data is not very good, so it is more likely to reject historical data rather than refine it. This is to avoid using unreliable historical data to refine current pixels and avoid errors.
Conversely, if CoarseRejectedPrevWeight is greater than or equal to CoarseRefiningPrevWeight, it means that there is a good correspondence between the current pixel and the historical data, so it is more likely to refine with historical data.

Compared to others, ToneWeight calculation is simpler. It first transforms InputColor into YCoCg color space to obtain Luma. Then it calculates the weight using HdrWeightY.

f(Luma) = \max\left(0.0001, \frac{1}{Luma4 + 4}\right) = \max\left(0.0001, \frac{1}{Luma + 1}\right)

Luma reflects the brightness changes in an image. The human eye and brain are more sensitive to this.

‣

If you directly read the source code, you may have the same confusion as me: why is it +4? If you want to know, open this toggle block.

Filter History Color

Why do we need to filter the History Color? Why not sample the History Buffer from the previous frame based on the output position O?

The answer is that the History Buffer from the previous frame may not match the current frame completely. There are situations where they don't match, such as:

Camera rotation
Movement of objects in the scene

Here, we'll focus on the situation of camera rotation: we need to calculate the correct sampling position based on the previous and current camera matrices.

So, assuming the current sampling position O corresponds to position P in the past. Please note that P is a float-type coordinate, because the historical position corresponding to O may not necessarily fall exactly in the center of a HistoryBuffer pixel.

The question is, how do we perform interpolation to obtain the value corresponding to P?

The simplest implementation is to find the Pixel range that P is in. A smoother approach is to perform bilinear interpolation on the surrounding pixels based on P's position.

However, TSR uses a more complex approach. It selects a cross-shaped pixel area with P at the center, totaling 5 pixels, and then uses the Catmull-Rom kernel for sampling.

So at this point, we have analyzed the first half of the entire Filter process, which is the part of Filter History Info.

Next, for the obtained PrevHighFrequencyColor, a Clamp will be performed using the Min and Max of the Input Color sample.

So how to choose whether to use the Clamp version or the non-Clamp version? It is determined by the calculation of History Clamp on the right.

If the current area is disoccluded, it means that the historical information is not very reliable, and the Clamp version should be used.
If LowFrequencyRejection is low, which tends to reject this historical pixel, then the Clamp version should be used.
If none of these is true and the confidence is high, then the non-Clamp version should be used.

Blend Between Input and History

This calculation is very complex, and it's difficult for me to analyze all the content step by step for the readers.

Therefore, based on this figure, I will focus on analyzing one question: what factors determine the weight of History.

Tone Weight

The Tone Weight section is located in the bottom right corner of the previous image.

The rules for this part are similar to the previous formula: the higher the Luma, the lower the weight.

f(Luma) = \max\left(0.0001, \frac{1}{Luma4 + 4}\right) = \max\left(0.0001, \frac{1}{Luma + 1}\right)

Speed-based Validity with Luma Clamping

ℹ️

UnrealEngine/Engine/Shaders/Private/TextureSampling.ush

First, MaxValidity as a threshold will reduce the historical weight. This is achieved by taking the Min with the PrevWeightcalculated in the previous step.

In other words, if TSR determines that the current pixel position is not stable enough, it will tend to choose the input pixel InputColor.

How to define "stability"? TSR calculates it based on the speed information of the current pixel: obviously, the faster the movement speed of the corresponding area of the current pixel, the less stable it is.

However, due to increased reliance on Input Color, this obviously leads to flickering in unstable regions. To address this, TSR has added an additional patch that compares History (HighFrequencyColor) with InputColor to identify high-contrast areas. For these areas, the MaxValidity value is increased to achieve greater stability.

History Weight

Let's focus on this area.

Several important control parameters are shown here:

ClampedPrevHistoryValidityfrom the historical MetaData Buffer: The validity information of the previous history is recorded pixel by pixel here. Note that this is different from the validity judgment based on the current frame's velocity.
MinRejectionBlendFactor: The validity judgment obtained by comparing the low-frequency information of images.
CurrentWeight: Will be detailed later
CoarseCurrentContribution: The alternative solution when Current Weight is 0

Now let’s discuss CurrentWeight:

For KernelInputToHistoryAlignmentFactor:

When a pixel is off-screen or occluded, the value of KernelInputToHistoryAlignmentFactor will be maximal, i.e. 1.0. This may be because in these cases, the historical data is less valuable to the current pixel, so more weight needs to be given to the current frame.
When LowFrequencyRejection increases, KernelInputToHistoryLerp also increases, which causes KernelInputToHistoryAlignmentFactor to be closer to InputToHistoryFactor.
If some refinement operation is being performed (bIsRefining is true), then the value of KernelInputToHistoryAlignmentFactor will also be closer to InputToHistoryFactor.

And finally, two hyper-parameters from constant buffer are controlling all of these things:

InputToHistoryFactor: Participate in the calculation of KernelInputToHistoryAlignmentFactor, used to balance the degree to which the current frame input data tends to be relied upon when historical data is unreliable.
HistoryHisteresis: Overall control of the calculation result of CurrentWeight, used to determine how quickly historical information will slide and update.

Warning: Implementation Changed

Please note that if you have read the Temporal AA introduction PPT for Unreal Engine 4, you may have noticed the use of YCoCg Box for determining whether to reject historical information. This part of the code has been modified during the iteration process of Unreal Engine 5. Therefore, I do not believe that Unreal Engine is still using this approach.

Please check this:

github.com

Resolve History

ℹ️

Engine/Shaders/Private/TemporalSuperResolution/TSRResolveHistory.usf

How to output the data of a higher resolution History Buffer as an image with the final screen resolution?

The answer is simple: composite the pixels covered by each output pixel mapping region into a final output.

So the question becomes: how to filter and combine these pixels?

A simple solution is to take the average value of these pixels.

However, TSR chose a better solution, which uses the Mitchell-Netravali filter.

Mitchell–Netravali filters

The Mitchell–Netravali filters or BC-splines are a group of reconstruction filters used primarily in computer graphics, which can be used, for example, for anti-aliasing or for scaling raster graphics. They are also known as bicubic filters in image editing programs because they are bi-dimensional cubic splines.[1][2][3]

en.wikipedia.org