Friday, November 27, 2020

Reverse engineering the rendering of The Witcher 3, part 21 - the painted world

This post is a part of the series "Reverse engineering the rendering of The Witcher 3".


If you have played "Hearts of Stone" expansion pack you might recall one of quests when Geralt finds himself in so-called "painted world". What is special about it is it looks much different than the "normal" world.

The following two screenshots present the game before and after this effect is applied. This quest is the only opportunity to see this effect in the game.



As you can see, the effect applied makes the normally rendered scene look like a painting, I get some The Starry Night vibes from it.

Let's find out how this has been implemented.

It's just one fullscreen postprocessing pass, applied near the end of postfx chain. 
I posted the assembly (418 lines!) here. In the post I'll focus on the general idea and will omit some game-specific nuances (for instance, the game performs stencil test to exclude characters from the final effect).

Taking a look at the beginning of the assembly, there is a number of snippets similar to this one:
  12: frc r3.x, r3.x  
  13: mov r1.w, r0.w  
  14: dp2 r0.z, r0.zwzz, l(25.979601, 156.466003, 0.000000, 0.000000)  
  15: sincos r0.z, null, r0.z  
  16: mul r0.z, r0.z, l(43758.546875)  
This indicates the use of one of those 'random' shader functions, you've probably seen this one (or its variants) already:
 float random(in float2 st)   
 {  
    return frac(sin(dot(st.xy, 2.0*float2(12.9898,78.233))) * 43758.5453123);  
 }  

The shader starts with these lines:
   0: mul r0.xy, v0.xyxx, cb12[23].zwzz  
   1: mul r0.y, r0.y, cb12[23].y  
   2: div r0.y, r0.y, cb12[23].x  
   3: mul r0.z, r0.y, l(50.000000)  
   4: round_ni r1.z, r0.z  
   5: mul r0.z, r0.x, l(50.000000)  
   6: round_ni r2.x, r0.z  
   7: mov r1.w, r2.x  

which we can write in HLSL as so:

   static const float TILES = 50;

   float2 pixelPos = Input.PositionH.xy; // SV_Position  
   float2 screenSize = viewportSize.xy;  
   float2 pixelSize = viewportInvSize;  
   
   float2 Texcoords = pixelPos * pixelSize;  
   
   float invAspectRatio = (Texcoords.y * screenSize.y) / screenSize.x;  
     
   // Divide the screen into tiles  
   float2 st;  
   st.x = floor(Texcoords.x * TILES);  
   st.y = floor(invAspectRatio * TILES);  

Using the standard postfx inputs (fullscreen res, pixel pos, pixel size) we divide the screen into a number of 'tiles'. On X axis you have totally 50 of them, on Y 28 (for 16:9 aspect). They are integers so they go like 1, 2, 3, ..., 50 for X axis and similarly for Y one.



Note that most pixels share the same 'tile' value.

Once the 'tile' values are obtained, they serve as inputs to random function. Calling
 float noise1 = random(st);  
   
 return noise1.xxxx;  
gives:

which visualizes tiles nicely.

However, the shader does not call random just once. There are 16 calls to it:
   // get noise  
   float noise1 = random(st);  
   float noise2 = random(st + float2(0, 2));  
   float noise3 = random(st + float2(-1, -1));  
   float noise4 = random(st + float2(2, 2));  
   float noise5 = random(st + float2(2, 0));  
   float noise6 = random(st + float2(2, -1));  
   float noise7 = random(st + float2(-1, 0));  
   float noise8 = random(st + float2(1, 0));  
   float noise9 = random(st + float2(1, 2));  
   float noise10 = random(st + float2(-1, 2));  
   float noise11 = random(st + float2(2, 1));  
   float noise12 = random(st + float2(0, 1));  
   float noise13 = random(st + float2(-1, 1));  
   float noise14 = random(st + float2(0, -1));  
   float noise15 = random(st + float2(1, 1));  
   float noise16 = random(st + float2(1, -1));  

And a small diagram which visualizes the order of the calls:


The shader performs now a number of calculations to get relatively smooth noise (not interested with blocky appeareance!). 

First of all, for each of center 'blocks' (1, 8, 12, 15) we assign weights of 0.25 and we do the similar for their neighbours. It'll be more clear once visualized.

Let's start from the block nr 1:
for the center block (1) we assign a weight of 0.25 (1/4). For the red ones - a weight of 0.125 (1/8) and for the blue ones - a weight of 0.0625 (1/16)

In the shader this can be implemented like so:
   float p1 = 0.0;  
   p1 += (noise3 + noise16 + noise13 + noise15) / 16.0;  
   p1 += (noise14 + noise12 + noise7 + noise8) / 8.0;  
   p1 += noise1 / 4.0;  

For block nr 8 it's the same idea:

   float p8 = 0.0;  
   p8 += (noise14 + noise6 + noise12 + noise11) / 16.0;  
   p8 += (noise16 + noise15 + noise1 + noise5) / 8.0;  
   p8 += noise8 / 4.0;  

And the same idea goes for blocks 12 and 15.

Here's how p1 looks like: 

Much smoother! Yet it's still blocky because most of pixels share the same 'tiles' values (float2 st).

To remove blockiness, the shader performs a number of interpolations. Ideally we would like to have values in [0-1] range within each tile. These can be calculated with:
   // Get frac parts for interpolation  
   float interp_x = TILES*Texcoords.x - st.x;  
   float interp_y = TILES*invAspectRatio - st.y;  

Here's interp_x:

and interp_y:

Having the values, interpolation goes as:
   // First, interpolate between the 'lower' ones: 12 and 15  
   float t_lower = lerp(p12, p15, interp_x);  
  
   // Then interpolate between the 'upper' ones: 1 and 8  
   float t_upper = lerp(p1, p8, interp_x);  
     
   // final interpolate  
   float t = lerp(t_upper, t_lower, interp_y);  

This gives us:

To get circular patterns on the screen, some basic trigonometry is applied:
   t *= 6.283185;  // ~2*PI, full phase
     
   float2 offsets;  
   offsets.x = sin(t);  
   offsets.y = cos(t);  

And here is a visualization of offsets (applied * 0.5 + 0.5 to squish them from [-1;1] to [0;1] range )


Having the offsets we can perform texture sampling (the game uses point clamp sampler)
   float2 minUv = float2(0.5, 0.5) * pixelSize;  
   float2 maxUv = (viewportSize.xy - float2(0.5, 0.5)) * pixelSize;  
   
   float3 color = 0.0.xxx;  
   [unroll] for (int i = -5; i <= 5; i++)  
   {  
     float2 Tex = Texcoords + i * pixelSize*offsets;  
     Tex = clamp(Tex, minUv, maxUv); 
       
     color += texture0.Sample( samplerPointClamp, Tex ).rgb;  
   }  
   color /= float(11);  
   
   return float4(color, 1.0);
And that's pretty much the idea. The code above is, of course, a simplified version. Again, the game does a little bit more (using depth buffer to weaken the effect, using stencil buffer to exclude characters etc.)

I put the shader on Shadertoy if you want to experiment with it.

Monday, July 27, 2020

Reverse engineering the rendering of The Witcher 3, part 20 - light shafts

This post is a part of the series "Reverse engineering the rendering of The Witcher 3".


In this 20th (wow) post I am going to cover how the "Light Shafts" (aka crepuscular rays, God rays, sunbeams...) effect from "The Witcher 3: Wild Hunt" works. I focus not only on presenting the idea of the algorithm, but also provide the game-specific implementation details, code snippets and a Shadertoy the reader can experiment with.

First, a brief showcase of the effect:

Before applying the effect
After applying the effect
In the fourth edition of "Real-Time Rendering" there is a high-level overview of the effect:
"First, a high-contrast correction curve is applied to the input image to isolate the unoccluded parts of the sun. Next, radial blurs, centered on the sun, are applied to the image (...) in a series, each operating on the output of the previous one (...). The final image of the flare is combined additively with the original scene rendering."

As it states, there are 4 steps involved:
- Isolating the unoccluded parts of the Sun
- Applying a high-contrast correction curve to extract the brightest pixels from the first step
- Performing a series of radial blurs
- Additive blending with the main scene color

I am going to cover all of them, providing extra details and example code where needed.

The description from the book serves to give a brief idea of how the effect works. In practice, it's a bit more complex. For example, in the implementation of The Witcher 3, the light shafts are strictly intertwined with bloom. You cannot have the light shafts working if the bloom is turned off. Maybe it's because (it's just my guess), at some point later, the light shafts are combined together with (not fully blurred yet) bloom - therefore both effects can benefit from blur.


1. Isolating the unoccluded parts of the Sun

The first step is simple - extracting sky pixels. It starts with a fullscreen floating-point LDR buffer which has already been processed with tonemapping (and later possibly with: depth of field, drunk effect and motion blur). The scene depth buffer is used as well.

Input 1 - fullscreen LDR color buffer
Input 2 - fullscreen depth buffer
Let's take a look at the pixel shader assembly:
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb12[23], immediateIndexed  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t1  
    dcl_input_ps_siv v0.xy, position  
    dcl_output o0.xyzw  
    dcl_temps 2  
   0: ftoi r0.xy, v0.xyxx  
   1: mov r0.zw, l(0, 0, 0, 0)  
   2: ld_indexable(texture2d)(float,float,float,float) r1.x, r0.xyww, t1.xyzw  
   3: ld_indexable(texture2d)(float,float,float,float) r0.xyz, r0.xyzw, t0.xyzw  
   4: mad r0.w, r1.x, cb12[22].x, cb12[22].y  
   5: ge r0.w, r0.w, l(1.0000)  
   6: and r0.w, r0.w, l(1.0000)  
   7: mul o0.xyz, r0.wwww, r0.xyzx  
   8: mov o0.w, l(1.0000)  
   9: ret  

This is a typical fullscreen pass. In this simple shader the inputs textures are: R11G11B10_FLOAT color texture (t0) and depth (t1). The shader outputs sky pixels only - the ones which pass (depth == 1.0) test.

The Witcher 3 uses reversed depth, so 1.0 is represented as black color in the depth texture. A flip (1.0 - depth) is performed at line 4.
The output is a fullscreen R11G11B10_FLOAT texture.

Output. Note that sky pixels are the only ones left.

This shader can be written like so in HLSL:
 Texture2D TexColor : register (t0);  
 Texture2D TexDepth : register (t1);  
   
 float4 LightShafts_SkyOnlyPS( in float4 Position : SV_Position ) : SV_Target  
 {  
   int3 pos = int3(Position.xy, 0);  
   float depth = TexDepth.Load(pos).r;  
   float3 color = TexColor.Load(pos).rgb;  
     
   // Perform '1.0 - depth' flip.
   // g_depthScale = -1.0
   // g_depthBias = 1.0
   depth = depth * g_depthScale + g_depthBias;  
     
   bool isSky = (depth >= 1.0);  
   color *= isSky;  
     
   return float4(color, 1.0);  
 }  



2. Applying high-contrast correction curve

Having the sky pixels only, it's time to extract the brightest of them.
Input - fullscreen texture with sky pixels only
Output - the brightest pixels have been exposed
Note while the output texture is in fact fullscreen (1920x1080), the result is rendered in half-res (960x540) to the upper left region of the texture.

The shader used for extracting the brightest pixels for light shafts is the same as for the bloom - usually called a threshold pass. Here is its assembly:
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb3[10], immediateIndexed  
    dcl_sampler s0, mode_default  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_input_ps_siv v0.xy, position  
    dcl_output o0.xyzw  
    dcl_temps 4  
   0: ftoi r0.xy, v0.xyxx  
   1: ftoi r0.zw, cb3[4].zzzw  
   2: iadd r0.xy, -r0.zwzz, r0.xyxx  
   3: itof r0.xy, r0.xyxx  
   4: add r0.xy, r0.xyxx, l(0.5000, 0.5000, 0.0000, 0.0000)  
   5: div r0.xy, r0.xyxx, cb3[3].xyxx  
   6: round_z r0.zw, cb3[4].xxxy  
   7: mad r0.xy, r0.xyxx, cb3[1].xyxx, r0.zwzz  
   8: div r0.xy, r0.xyxx, cb3[0].xyxx  
   9: div r1.xyzw, l(-1.0000, -1.0000, 1.0000, -1.0000), cb3[0].xyxy  
  10: add r1.xyzw, r0.xyxy, r1.xyzw  
  11: add r2.xyzw, cb3[5].xyzw, l(0.5000, 0.5000, 0.5000, 0.5000)  
  12: div r2.xyzw, r2.xyzw, cb3[0].xyxy  
  13: max r1.xyzw, r1.xyzw, r2.xyxy  
  14: min r1.xyzw, r2.zwzw, r1.xyzw  
  15: sample_l(texture2d)(float,float,float,float) r3.xyz, r1.xyxx, t0.xyzw, s0, l(0)  
  16: sample_l(texture2d)(float,float,float,float) r1.xyz, r1.zwzz, t0.xyzw, s0, l(0)  
  17: add r1.xyz, r1.xyzx, r3.xyzx  
  18: div r0.zw, l(0.0000, 0.0000, -1.0000, 1.0000), cb3[0].xxxy  
  19: add r0.zw, r0.zzzw, r0.xxxy  
  20: max r0.zw, r2.xxxy, r0.zzzw  
  21: min r0.zw, r2.zzzw, r0.zzzw  
  22: sample_l(texture2d)(float,float,float,float) r3.xyz, r0.zwzz, t0.xyzw, s0, l(0)  
  23: add r1.xyz, r1.xyzx, r3.xyzx  
  24: div r0.zw, l(1.0000, 1.0000, 1.0000, 1.0000), cb3[0].xxxy  
  25: add r0.xy, r0.zwzz, r0.xyxx  
  26: max r0.xy, r2.xyxx, r0.xyxx  
  27: min r0.xy, r2.zwzz, r0.xyxx  
  28: sample_l(texture2d)(float,float,float,float) r0.xyz, r0.xyxx, t0.xyzw, s0, l(0)  
  29: add r0.xyz, r0.xyzx, r1.xyzx  
  30: mul r0.xyz, r0.xyzx, l(0.2500, 0.2500, 0.2500, 0.0000)  
  31: dp3 r0.w, cb3[9].xyzx, r0.xyzx  
  32: add r1.x, r0.w, -cb3[8].x  
  33: max r0.w, r0.w, l(0.0001)  
  34: max r1.x, r1.x, l(0)  
  35: mul_sat r1.y, r1.x, cb3[8].y  
  36: mul r1.x, r1.y, r1.x  
  37: min r1.x, r1.x, cb3[7].x  
  38: div r0.w, r1.x, r0.w  
  39: mul r0.xyz, r0.wwww, r0.xyzx  
  40: mul o0.xyz, r0.xyzx, cb3[6].xxxx  
  41: mov o0.w, l(0)  
  42: ret  

Now I am going to explain how this shader works. I focus on the general idea here - the shader from the game allows to use of an arbitrary region from the input texture; I will omit these nuances for simplicity's sake.

First, let's find out a pixel location in [0-1] texture space.
 float2 textureUV = Input.SV_Position.xy;  
 textureUV += float2( 0.5, 0.5 );  
 textureUV /= outputTextureSize.xy;   // outputTextureSize = (960, 540)  


Then. the shader determines minimum and maximum possible sampling coordinates (in this case input texture has a size of 1920x1080):
 // Minimum (0,0) and Maximum (1919,1079) input pixel  
 float2 outputTextureMinPixel = cb3_v5.xy;  
 float2 outputTextureMaxPixel = cb3_v5.zw;  
     
 // Calculate min/max sampling coordinates  
 float2 minSamplingUV = outputTextureMinPixel + float2(0.5, 0.5);  
 float2 maxSamplingUV = outputTextureMaxPixel + float2(0.5, 0.5);  
 minSamplingUV /= inputTextureSize.xy; // inputTextureSize = (1920, 1080)  
 maxSamplingUV /= inputTextureSize.xy;  


Next, for each processed pixel, the shader samples 4 pixels in its neighborhood with respect to the just calculated min/max texture coordinates. After that an average color is calculated:
 static const float2 g_offsets[4] =  
 {  
  float2( -1, -1 ),  
  float2(  1, -1 ),  
  float2( -1,  1 ),  
  float2(  1,  1 )  
 };  
   
 float3 color = 0;  
 [unroll] for (int i = 0; i < 4; i++)  
 {  
   float2 offset = g_offsets[i] / inputTextureSize.xy; // (1920, 1080)  
   float2 samplingUv = textureUV + offset;  
   samplingUv = clamp( samplingUv, minSamplingUV, maxSamplingUV );  
   
   color += TexColor.SampleLevel( samplerLinearClamp, samplingUv, 0 ).rgb;  
 }  
   
 // calculate average color  
 color /= 4.0;  

The last (and the most important) bit is to determine the bloom amount.

This part starts from calculating brightness of the just-calculated average color (note that to this point in the shader all calculations were in LDR). It's possible to use various operators here, like maximum component from RGB color (see here). The Witcher 3 uses set of RGB weights to determine brightness (assembly line 31):

brightness = dot(color.rgb, weights.rgb);

Using such an operator allows for some flexibility; for instance, for underwater scenes having the bloom that uses only green and blue channels might be desired by artists. Also, having LDR allows to cover whole RGB range in [0-1] which is more intuitive while tweaking the weights.

Once brightness is obtained, the next parameter in the bloom calculations is threshold. It is used to cut off less bright pixels from further consideration:

contribution = max( 0.0, brightness - threshold );

The following assembly snippet covers calculation of bloom amount for a pixel:
  31: dp3 r0.w, cb3[9].xyzx, r0.xyzx  
  32: add r1.x, r0.w, -cb3[8].x  
  33: max r0.w, r0.w, l(0.0001)  
  34: max r1.x, r1.x, l(0)  
  35: mul_sat r1.y, r1.x, cb3[8].y  
  36: mul r1.x, r1.y, r1.x  
  37: min r1.x, r1.x, cb3[7].x  
  38: div r0.w, r1.x, r0.w  
  39: mul r0.xyz, r0.wwww, r0.xyzx  
  40: mul o0.xyz, r0.xyzx, cb3[6].xxxx  

In general, calculating bloom amount in The Witcher 3 can be thought as a function of four variables:



where:
fbloom - amount of bloom for a pixel,
b - brightness,
t - threshold,
nmax - maximum allowed value for nominator,
nscale - scale factor for contribution parabola.

So far, I haven't mentioned the last two parameters but they are quite simple.  nscale  is a scale factor for previously introduced contribution2 whereas nmax is maximum allowed nominator value.

To illustrate the function, let's assume that we don't care about nmax and let's pick 0.15 for t and 0.8 for nscale . Therefore, the equation simplifies to the following function of one variable:


The graph of this function looks like this:

To understand why "max(0)" is needed, here is graph of the same function but max(0, b-0.15) is replaced with just b-0.15:

The function has its zero point at 0.15, which is the threshold value. Using max, the threshold serves as cutoff so everything less than it is zeroed. Without max, the closer to zero the value of the function rises due to the 1/x factor.

Once bloomAmount is obtained, it's multiplied with the average color calculated earlier. However, there is one more multiplication in the end with "final boost" value (assembly line 40).
 float bloomAmount;  
   
 // ... calculate bloom amount ... //     
 color.rgb *= bloomAmount;  

 // Perform final boost
 color.rgb *= finalBoost;

 return float4(color.rgb, 0.0);


For reference, here are values from constant buffer for this particular scene:
The ones of interest are:

cb3_v9.rgb - bloom weights,
cb3_v8.x - brightness threshold,
cb3_v8.y - nominator scale
cb3_v7.x - nominator max scale
cb3_v6.x - final boost (in every frame tested, this is 100)

Having that in mind, everything related to bloom threshold in HLSL could be written like so:
 // Inputs:  
 const float3 bloomWeights = params9.rgb;  
 const float bloomThreshold = params8.x;  
 const float bloomScale = params8.y;  
 const float bloomMaxNominator = params7.x;  
 const float finalBoost = params6.x;  
   
 float brightness = dot( bloomWeights, color );  
 float contribution = max( brightness - bloomThreshold, 0.0);  
   
 // Calculate amount of bloom.  
 // The Witcher 3 uses a hyperbola:   
 // f(x) = ( b*(x-a)^2 ) / x  
 // where:  
 //  
 // x - input brightness  
 // f(x) - amount of bloom for given brightness  
 // b = bloom scale  
 // a = bloom threshold    
 contribution *= saturate( contribution * bloomScale );   
 contribution = min(contribution, bloomMaxNominator); // nominator can't be too high  
   
 const float bloomAmount = contribution / max(brightness, 0.0001);        // avoid too small denominator  
   
 // Apply the bloom amount to color  
 color *= bloomAmount;  
   
 // Perform final boost (usually multiplying by 100)  
 color *= finalBoost;  
   
 return float4(color, 0);  

Here I present a few debug screenshots I obtained by changing selected bloom parameters:

Weights = float3(0.0, 0.0, 1.0)



Weights = float3(0.0, 0.0, 0.5)



Threshold = 1.0


Threshold = 1.5



3. Radial blurs

The crucial part of the light shafts in The Witcher 3 is implemented using two radial blurs passes. Before I get to these, here is an input texture to the first one - obtained from further downsampling the result from the just covered second step:


The first radial blur pass renders to 1/16th of the scene(480x270). For the considered frame it gives the following result, with characteristic ring pattern:


The output from the first pass is an input for the second pass which produces:

To gracefully introduce the reader in radial blur and where the rings come from, here is a Shadertoy with its basic implementation: here.

The reader is encouraged to take a look at and investigate it. There is a few defines which control the blur, each one is described.

Why the multi pass approach is being used? The answer is quite simple - performance (obviously). Using more than one pass reduces complexity of the algorithm significantly. The provided Shadertoy demonstrates it. In the multi pass scenario three passes are used, max 8 texture fetches per each one. In total 8*3 = 24 samples. To achieve the same effect in just one pass, shader would need to sample a texture 83 = 512 times per pixel.

As for the implementation in The Witcher 3, the first radial filter does 16 sparsely placed samples that are subsequently refined in the second pass (it uses the same number of taps, but spaced more closely, to fill the gaps) - this gives a fairly long blur distance while keeping the perf (number of samples needed) reasonable. Usually the spacing in these sorts of filters is chosen so that the last one fills all the gaps and avoid Moire patterns.

Both passes use the same shader - which is almost 300 lines long, so I put it on pastebin for convenience. The shader assembly may seem very intimidating at first, but actually, it's not that bad. A brief look at it and it's quite easy to distinct two separate unrolled loops (16 steps) which stand for ~75% of whole assembly.

The idea of using radial blur for God rays is not new - it was used in demoscene in the past millennium [4]. A similar approach apparently was also used in The Witcher 2. [2]  Crysis was using this technique as well, but the blur was performed on the depth buffer. [1]

Let's start analysis of the shader.

3.1. Finding required positions

The first step is to determine current pixel position and both min and max possible sampling texture coordinates - it's quite similar to the bloom threshold shader.

Here is a part of assembly responsible for it:
   0: round_z r0.xyzw, cb3[3].zwxy  
   1: add r0.xy, -r0.xyxx, v0.xyxx  
   2: add r1.xy, r0.zwzz, l(0.500000, 0.500000, 0.000000, 0.000000)  
   3: add r1.zw, r0.zzzw, cb3[2].xxxy  
   4: add r1.zw, r1.zzzw, l(0.000000, 0.000000, -0.500000, -0.500000)  
   5: div r1.xyzw, r1.xyzw, cb3[0].xyxy  
   6: round_z r0.xy, r0.xyxx  
   7: add r0.xy, r0.zwzz, r0.xyxx  
   8: add r0.xy, r0.xyxx, l(0.500000, 0.500000, 0.000000, 0.000000)  
   9: div r0.xy, r0.xyxx, cb3[0].xyxx  


In HLSL this be written like so:
 const float2 inputRegionSize =  cb3_v2.xy; // (480, 270)  
 const float2 inputTextureSize = cb3_v0.xy; // (960, 540)  
   
 // Determine min and max possible position to sample from input texture  
 float2 minPosition = float2(0.5, 0.5);   
 float2 maxPosition = inputRegionSize - float2(0.5, 0.5);  

 minPosition /= inputTextureSize;  
 maxPosition /= inputTextureSize;  
 
 // Determine position of currently processed pixel
 float2 pixelPosition = Input.SV_Position.xy;
 pixelPosition += float2(0.5, 0.5);  
 pixelPosition /= inputTextureSize;  


Next, let's find the center of radial blur (in case of The Witcher 3, the Sun) in texture space:
 const float2 sunScreenPos = cb3_v4.xy; // usually [0-1] range  
 float2 radialBlurCenter = lerp(minPosition, maxPosition, sunScreenPos); // (0.0-0.5) range!  

It's just a linear interpolation between just obtained min and max sampling positions. The question remains, how to find sunScreenPos on CPU side? To answer, let's consider a surface and the usual vectors used for lighting it:
Image credit: "Real-Time Rendering, 4th Edition" - provided under Fair Use

l is a normalized light vector, pointing towards the Sun. It can be 'transformed' into a point by multiplying this vector by a large value, like far plane distance, which yields 'position' of the Sun in world space. Then it can be projected to screen space.
Here is a pseudocode for it:
 // Obtain world position of the Sun from light vector towards the Sun  
 Vec3 sunPos_W = lightVectorTowardsSun.Normalize() * camera->GetFarPlane();  
   
 // Position of the Sun in clip space  
 Vec3 sunPos_C = camera->GetMatrixViewProj().TransformPoint(sunPos_W);  
   
 // viewport transform  
 // https://docs.microsoft.com/en-us/windows/win32/direct3d11/d3d10-graphics-programming-guide-rasterizer-stage-getting-started  
 Vec2 sunPos2D;  
 sunPos2D.x = (1.f + sunPos_C.x) * (float) GetBackbufferWidth() / 2.f;  
 sunPos2D.y = (1.f - sunPos_C.y) * (float) GetBackbufferHeight() / 2.f;  
   
 sunPos2D.x /= (float) GetBackbufferWidth();  
 sunPos2D.y /= (float) GetBackbufferHeight();  

 // submit to shader
 LightShaftsShader->SetSunPosition(sunPos2D);

Since this effect is coming from the Sun / the Moon only, there is a space for optimization by checking value of a dot product of light vector and camera 'look' vector - processing the light shafts can be skipped if the camera is facing away from the Sun.


3.2. Checking if pixel qualifies for radial filter

The next chunk of assembly determines if radial filter should be applied at all (if not, black color is immediately returned):
  12: div r3.x, cb3[0].x, cb3[0].y  
  13: add r2.zw, r0.xxxy, -r2.xxxy  
  14: mov r3.y, l(1.0000)  
  15: mul r2.zw, r2.zzzw, r3.xxxy  
  16: dp2 r0.z, r2.zwzz, r2.zwzz  
  17: sqrt r0.z, r0.z  
  18: div r0.z, r0.z, r0.w  
  19: div_sat r0.z, r0.z, cb3[5].x  
  20: lt r2.z, r0.z, l(1.0000)  
  21: if_nz r2.z  
      ...  
  284: else  
  285:  mov r0.xyz, l(0, 0, 0, 0)  
  286: endif  
  287: mov o0.xyz, r0.xyzx  
  288: mov o0.w, l(1.0000)  
  289: ret  

This step starts from calculating length from the radial blur center to currently processed pixel (lines 16-17).

Here is a visualization of such calculated length:

There is a problem with such estimated length, though. The shader uses circular pattern. To fix it, the vector needs to be 'squished' by an aspect ratio of the texture (line 15). Doing so fixes the problem:
Regardless of an aspect ratio, taking it into account gives such a circle so it's easy to find pixels which are out of desired radius. In this shader length < 1.0 condition is the main factor which decides whether a pixel qualifies for radial filter or not.

Before the condition check there are two more things to look at, though.

First, estimation of length happens in the area from minPosition to maxPosition, roughly from (0, 0) to (0.5, 0.5). What happens now is a remapping of length as if it was calculated from (0, 0) to (1, 1) range - see line 18. It is done with division by (maxPosition.y - minPosition.y) which is pretty much multiplying by two.

Second, let's introduce a parameter which scales step length - stepScaleFactor, by which the length is divided - line 19.

See the following screenshots which present how stepScaleFactor affects output of the first radial filter. Red area indicates pixels which fail length < 1.0 condition.

stepScaleFactor = 1.0


stepScaleFactor = 1.5


stepScaleFactor = 1.75


stepScaleFactor = 2.0



Example HLSL snippet:
 const float stepScaleFactor = cb3_v5.x;  


 float2 squishParams = float2(inputTextureSize.x / inputTextureSize.y, 1.0);  
 float2 sunToPixel = pixelPosition - radialBlurCenter;  
 sunToPixel *= squishParams;  
   
 float dist = length( sunToPixel ); 
   
 float y_diff = maxPosition.y - minPosition.y; // about 0.5  
 dist /= y_diff;   // Rescale as if length was calculated in [0-1] range (basically, mult by 2)  
 dist = saturate( dist / stepScaleFactor );  
   
 float3 finalColor = 0.0.xxx;  
 if (dist < 1.0)
 {
    ...
 }  


3.3. Finding step vector and number of steps

In general, the third step can be described like this: Calculate step vectors and number of steps for both passes, then select one set based on the index of the current one.

It starts just as in the Shadertoy:
 if (dist < 1.0)  
 {  
   float2 pixelToSun = normalize( radialBlurCenter - pixelPosition );  
   float maxDist = length( radialBlurCenter - pixelPosition );  
   
   // Step vectors for the first stage (step_0) and the second one (step_1)  
   float2 step_0 = pixelToSun;  
   float2 step_1 = pixelToSun;  

Having a normalized pixel -> center vector, a single step is calculated now. As mentioned earlier, The Witcher 3 uses 16 samples. For the second pass taps are placed more closely to fill the gaps from the first one:
 // Length of a single step 
 const float stepLength_0 = 1.0 / 16.0;
 const float stepLength_1 = (1.0 / 16.0) * (1.0 / 16.0);

 step_0 *= stepLength_0;  
 step_1 *= stepLength_1; 

Now the step vectors are adjusted with currently used parameters:

First, stepScaleFactor is taken into the account - the longer the factor, the longer single step is.

Second, because pixelToSun was normalized earlier, it has to be remapped back to (0.0 - 0.5) range since the upper left quadrant of the input texture will be sampled.

 // Scale step vectors  
 step_0 *= stepScaleFactor;  
 step_1 *= stepScaleFactor;  
   
 // The step vectors have been normalized earlier (pixelToSun),   
 // so scale them down so they match resolution   
 // of input texture region  
 step_0 *= y_diff;  
 step_1 *= y_diff;  

Another important thing to be aware of is aspect ratio.

At first normalized earlier pixelToSun vector is stretched horizontally which yields a vector with length >= 1. Then the step vectors are divided by the new length:
 // Take aspect ratio into account  
 // Note - pixelToSun is at this point normalized, so its length = 1  
 pixelToSun *= squishParams;  // squishParams = (~1.77777777, 1)
 
 // Adjust step vectors  
 step_0 /= length(pixelToSun);  
 step_1 /= length(pixelToSun);  

Here is a debug screenshot which shows the result of the first radial filter without taking aspect ratio into account while determining step vectors:

And here is with taking aspect ratio into account:

Once the step vectors are obtained, finding a number of steps for blur is easy:
 int CalculateRadialBlurSteps(float fMaxDist, float fStepLength)  
 {  
    float fNumSteps = fMaxDist / fStepLength;  
    fNumSteps += 1.0;  // add 1 for tap #0 (just sampling the input uv)
     
    return (int)fNumSteps;  
 }

 // * Calculate number of steps ("rings")  
 int numSteps_0 = CalculateRadialBlurSteps( maxDist, length(step_0) );  
 int numSteps_1 = CalculateRadialBlurSteps( maxDist, length(step_1) );  


The index of the current radial blur pass is provided as constant buffer value (0 for the first pass, 1 for the second one). To select proper step vector and number of steps, a simple conditional expression is performed:
 // Select radial blur params for this pass  
 const int iRadialBlurPass = (int)cb3_v4.z;  
   
 float2 dir;  
 int numSteps;  
   
 if (iRadialBlurPass == 0)  
 {  
    dir = step_0;  
    numSteps = numSteps_0;  
 }  
 else if (iRadialBlurPass == 1)  
 {  
    dir = step_1;  
    numSteps = numSteps_1;  
 }  


So far, I assumed that a radial center is within texture. But this is not always the case. There could be a situation where the Sun is out of screen yet some remains of shafts are still visible.

Consider a situation when a radial blur center is outside of [0-1] range:


In the (rough) example above there are 8 steps total. Note that the last two are outside of input texture region so there is no point in sampling them since there is no such data. To reduce the number of texture fetches the calculations can be clamped with respect to minPosition and maxPosition and this is exactly what this shader does.

The idea is to find proper distances to texture edges depending on direction of just selected step vector:



To obtain toMin:
toMin = pixelPosition - minPosition

To obtain toMax:
toMax = maxPosition - pixelPosition

To determine which distances will be used for clamping, the following test is performed: stepVector > (0, 0), which means towards maxPosition. Note that this test is done separately for both X and Y axis, so the final result can be, as in the example above, (toMaxX, toMinY).

At the very end the result is divided by abs(stepVector) which yields maximum number of steps for both X and Y axis. The lesser one is taken.

This clamping is performed by the following assembly snippet:
  48:  lt r3.xy, l(0, 0, 0, 0), r2.xyxx  
  49:  add r1.zw, -r0.xxxy, r1.zzzw  
  50:  add r1.xy, -r1.xyxx, r0.xyxx  
  51:  movc r1.xy, r3.xyxx, r1.zwzz, r1.xyxx  
  52:  div r1.xy, r1.xyxx, abs(r2.xyxx)  
  53:  add r1.xy, r1.xyxx, l(1.000000, 1.000000, 0.000000, 0.000000)  
  54:  ftoi r1.xy, r1.xyxx  
  55:  imin r1.x, r1.x, r2.z  
  56:  imin r1.x, r1.y, r1.x  

To provide some context, the values in the registers at line 48 are:
r2.xy - just selected dir
r0.xy - pixelPosition,
r1.xy - minPosition,
r1.zw - maxPosition

The equivalent HLSL can be written similar to this one:
 int CalculateMaxRadialSteps_ScreenSpace(float2 vStepVector, float2 pixelPosition, float2 minTexCoord, float2 maxTexCoord)  
 {  
   // Determine the minimum required number of steps to sample.  
   // This is used for cases when radial blur center is outside of screen  
   // Think of it as clamping number of steps using texture space  
     
   // The step vector is "pixel to center" - find its direction towards texture space  
   float2 direction = (vStepVector > float2(0,0));  
     
   // Distances from current pixel position to min and max texture coordinates.  
   // Note all of them are always positive - think of them as distances, not as vectors.  
   float2 distancesToMax = maxTexCoord - pixelPosition;
   float2 distancesToMin = pixelPosition - minTexCoord;  
     
   // Select proper distances depending on direction of the step vector.  
   // Note that we can select, for instance, X from 'distancesToMax' and Y from 'distancesFromMin'.  
   float2 selectedDistances = direction ? distancesToMax : distancesToMin;  
     
   // Calculate number of steps in texture space  
   float2 fResults = selectedDistances / abs(vStepVector);  
     
   // Add "1" for the first sample  
   fResults += float2(1.0, 1.0);  
     
   int2 iResults = (int2) fResults;     
   return min(iResults.x, iResults.y);  
 }  

To obtain the final number of taps another min is performed:
 // Determine the minimum required number of steps  
 // This is used for cases when radial blur center is outside of screen  
 int step_UV = CalculateMaxRadialSteps_ScreenSpace(dir, pixelPosition, minPosition, maxPosition);  
   
 // Final number of steps  
 numSteps = min( numSteps, step_UV );  
   



3.4. Sampling texture and final touches

This is (really) the last step of the radial blur. Again, I start from describing it in large picture, then I will explain the details.

At this point what the shader has is a step vector, final number of taps and index of radial blur pass.

Let's take a look at assembly which covers the general flow of this step:
 57:  if_nz r0.w  
     ... sampling loop for the second pass ...  
 108: else  
     ... sampling loop for the first pass  ...  
 273: endif  
 274: mul r1.xyz, r1.yzwy, l(0.062500, 0.062500, 0.062500, 0.000000)  
 275: add r0.x, -r0.z, l(1.000000)  
 276: log r0.x, r0.x  
 277: mul r0.x, r0.x, cb3[5].y  
 278: exp r0.x, r0.x  
 279: mul r0.y, r0.z, r0.z  
 280: mad r0.y, r0.y, cb3[5].z, l(1.000000)  
 281: div r0.x, r0.x, r0.y  
 282: mul r0.xyz, r0.xxxx, r1.xyzx  
 283: movc r0.xyz, r0.wwww, r0.xyzx, r1.xyzx  

Depending on the index of the pass, a different loop is executed which puts calculated color into r1.xyz registers.

Lines 274-283 are common for both passes and I will start from describing this part.

Common

Regardless of current pass, an average color is calculated at line 274 - it is just a division by 16. For the first radial blur pass, this color is returned immediately since r0.w is zero (line 283).
However, for the second pass an extra fadeout / falloff parameter is calculated which is used for multiplication with the color.

The falloff parameter which range is [0-1] is used to decrease the brightness of pixels the further they are from radial center. It's calculated from dist and this is probably why dist < 1.0 test was performed earlier. There are two parameters which control how fast the falloff from center is: exponent and attenuation (it's how I named them).

It's quite simple to write it in HLSL. Again, please keep in mind that color*falloffFactor multiplication occurs at the end of the second pass:
 const float FinalFalloffExponent = cb3_v5.y;  
 const float FinalAttenuation = cb3_v5.z;  
   
 float nominator = pow(1.0 - dist, FinalFalloffExponent);  
 float denominator = FinalAttenuation*dist*dist + 1;  
   
 float falloffFactor = nominator / denominator;  

This is how fallofFactor looks like for most cases:

And this is how the final frame would look like if the falloff wasn't used:



The first loop
All what's left to cover are sampling loops. I will start from the first pass. Here's a snippet from it:
  150:   ilt r5.xyzw, l(4, 5, 6, 7), r1.xxxx  
  151:   and r5.xyzw, r5.xyzw, l(1.000000, 1.000000, 1.000000, 1.000000)  
  152:   sample_l(texture2d)(float,float,float,float) r4.xyz, r4.zwzz, t0.xyzw, s0, l(0)  
  153:   dp3 r2.z, l(0.212600, 0.715200, 0.072200, 0.000000), r4.xyzx  
  154:   add r2.w, r2.z, -cb3[7].x  
  155:   max r2.zw, r2.zzzw, l(0.000000, 0.000000, 0.000100, 0.000000)  
  156:   mul_sat r3.w, r2.w, cb3[7].y  
  157:   mul r2.w, r2.w, r3.w  
  158:   div r2.z, r2.w, r2.z  
  159:   mul r4.xyz, r2.zzzz, r4.xyzx  
  160:   mad r3.xyz, r5.xxxx, r4.xyzx, r3.xyzx  

It's important to remember that this is an unrolled loop, so it will always execute 16 times. To find out if current step is going to be used, the shader checks if the desired number of taps (r1.x) is greater than the current step number.

Obtaining sample position is the same as in the Shadertoy:
 if (iRadialBlurPass == 0) 
 {
    [unroll]  
    for (int step = 0; step < 16; step++)  
    {              
        bool doStep = (numSteps > step);  
        
        float2 samplingUV = pixelPosition + dir*step;  
        float3 color = texture0.SampleLevel( samplerLinearClamp, samplingUV, 0 ).rgb;  

The interesting aspect of the first loop is that it uses luminance threshold to darken less bright pixels. The idea is the same as for the bloom threshold described earlier, it just uses standard luminance RGB as weights and different cutoff and scale factors are provided:
   static const float3 LUMINANCE_RGB = float3(0.2126, 0.7152, 0.0722);
   
   ...

   const float threshold = cb3_v7.x;  
   const float scale = cb3_v7.y;  
   
   float pixelBrightness = dot(LUMINANCE_RGB, color);  
   
   float contribution = max(pixelBrightness - threshold, 0.0);  
   contribution *= saturate(contributuon * scale);  
   
   float amount = contribution / max(pixelBrightness, 0.0001);  
   
   color *= amount;         
   
   // Add (or not) sample  
   finalColor_radialBlur += doStep*color;  
} // if (iRadialBlurPass == 0

At the end of the loop a sample is added or discarded, depending on doStep boolean.


And a quick demonstration how luminance threshold affects the first blur.

From this frame (threshold = ~0.77):

Threshold = 0

Threshold = 1.25

Threshold = 2


The second loop
The loop for the second pass is much simpler - again, it's just like in the Shadertoy provided:
 if (iRadialBlurPass == 1)  
 {  
   [unroll]
   for (int step = 0; step < 16; step++)  
   {  
     bool doStep = (numSteps > step);  
       
     float2 samplingUV = pixelPosition + dir*step;  
     float3 color = doStep*texture0.SampleLevel( samplerLinearClamp, samplingUV, 0 ).rgb;  
   
     finalColor_radialBlur += color;  
   }  
     
   const float3 secondPassColor = cb3_v6.rgb;  
   finalColor_radialBlur *= secondPassColor;  
 }  

After all 16 taps in the second pass are obtained, the accumulated result is multiplied with a special secondPassColor value from constant buffer which serves as tint for the shafts - it's artist driven.
For instance, during sunset secondPassColor will represent more reddish or orangish color. For this particular scene, it's:

float3(740, 452, 379)

Playing with this value allows to, for instance, change light shafts color to more blueish:



As for the second pass, there are a few things worth noting. First, extra operations are performed - multiplying by secondPassColor and then, after dividing by 16, the falloff parameter is taken into account.

Second, the pass uses a blend state with Maximum operator for color. The actual render target bound contains downscaled full scene color (sky + objects) using the bloom threshold shader I explained earlier.

Render target before the second radial blur pass:

Render target after the second radial blur pass (note: output from radial blur shader has been processed with simple Reinhard tonemapping operator so brighter HDR scene pixels can actually be noticed):


4. Blending

The final blending is just using an additive blend state with original scene rendering so I think no extra explanation is necessary here except of a few notes:

- Fullscreen position needs to be remapped so it matches upper left quadrant of half-res texture,
- Light shafts + the bloom are modulated at this point with lens dirt texture - which not everyone likes.


Summary

Four steps - some of them trivial, some of them more complex (looking at you, radial blur) - and this is the end of this journey! Hopefully the light shafts from The Witcher 3 are less mysterious right now.

Because the effect is completely screen-based, its obvious weakness is lack of light shafts while not looking at the Sun. Nowadays it is probably more desired to look into volumetric solutions, like for instance the one from Red Dead Redemption 2 [3]

Below, there is a short list of references which I think the reader may find useful.

I have covered a lot of ground here. Nevertheless, I do hope you had as much fun as I did.


Acknowledgments

I would like to thank Michał Iwanicki and Eric Haines for their valuable suggestions and feedback.


References

[1] Tiago Sousa, "Crysis Next Gen Effects". GDC 2008. slides

[2] Bartłomiej Wroński, "Volumetric fog: Unified compute shader-based solution to atmospheric scattering". Advances in Real-Time Rendering in Games course, SIGRRAPH 2014. slides course page

[3] Fabian Bauer, "Creating the Atmospheric World of Red Dead Redemption 2: A Complete and Integrated Solution". Advances in Real-Time Rendering in Games course, SIGRRAPH 2019. slides course page

[4] Kenny Mitchell, "Volumetric Light Scattering as a Post-Process". GPU Pro 3, chapter 13. read

[5] Masaki Kawase, "Frame Buffer Postprocessing Effects in double-S. T. E. A. L (Wreckless)". GDC 2003 (note: this one was not referenced in the post but I still find it worth a check)