Tuesday, October 31, 2017

Higher Quality Vertex Normals

 I use the Oct16 format to encode my vertex normals, this format is two 8 bit channels in octahedral mapping.

  Most of the time this was sufficient, but under certain conditions artifacts were visible-- such as the surface of a smoothly varying sphere using triplanar texturing, whose weights are based on the normals.

Here is a visualization of the Triplanar weights generated from the Oct16 normals.


There is a very obvious diamond pattern visible.
Even switching to Oct20(10 bits per channel) does not completely solve this, the diamonds are much smaller, but they persist.

Oct16, but with custom scale/bias

Instead of adding bits, I decided to take advantage of the fact that most triangle patches only use a
limited range of the world space normals.

I track min/max per channel for the entire patch, then encode the normals so that the full range of bits is used.

Decoding in the shader requires a custom scale and bias parameter per channel(4 floats for the two channel Oct16).

There are no extra instructions,  as a fixed scale of 2 and bias of -1 was previously being used to transform from [0,1] to [-1,1] range.

The 2nd image was encoded this way, the normals are still using Oct16, so only 16 bits per normal, but with a custom scale/bias per patch.

 In the majority of cases this provides many extra bits of precision, and in the worst case it degrades back to standard Oct16.

Monday, October 30, 2017

Faster Triplanar Texturing

Here is a method I created to improve performance when using Triplanar texturing.
I also think it looks better.

So the standard triplanar texturing algorithm you will find in varous places on the internet looks something like this.

float3 TriPlanarBlendWeightsStandard(float3 normal) {
float3 blend_weights = abs(; 
blend_weights = (blend_weights - 0.55);
blend_weights = max(blend_weights, 0);   
float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

If we visualize the blend zones this is what it looks like.

Red/Green/Blue represent one texture sample.

Yellow/pink/cyan represent two textures samples.

And in the white corner we need all three.

As we can see the blend width is not constant, it is very small in the corner and quite wide along axis aligned edges.

The corner has barely any blending as we have pushed our blend zone out as far as possible by subtracting .55.(anything over 1/sqrt(3) or 0.577 results in negative blend zones in the corner).

This results in needless texture sampling along aligned edges, stealing away our precious bandwidth.

Constant Blend Width

What we want is something more like this-- constant blend width.

We do this by working in max norm distance instead of euclidean,  as our planes are axis aligned anyway--

Here is the modified code that generates this:
float3 TriPlanarBlendWeightsConstantOverlap(float3 normal) {

//float3 blend_weights =  abs(normal);
float3 blend_weights = normal*normal;
float maxBlend = max(blend_weights.x, max(blend_weights.y, blend_weights.z));
blend_weights = blend_weights - maxBlend*0.9f;

blend_weights = max(blend_weights, 0);   

float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

 You can adjust the blend width by changing the scalar .9 value.

On my GPU the constant version runs slightly faster, likely because there are less pixels where more than one texture sample is required.

I believe it also looks better--as there is less smearing along axis aligned edges.

Here is a shadertoy I created if you want to play with it

Saturday, October 28, 2017

Barycentric Coordinates in Pixel Shader

I needed a way to perform smooth blending between per vertex materials.
Basically I needed barycentric coordinates + access to each vertices material in the pixel shader.

Geometry Shader method:  Assign the coordinates: (1,0,0), (0,1,0), (0,0,1) to the vertices of the triangle.  Also write the three materials to each vertex.
This method is easy to implement but has terrible performance on many cards, since it requires a geometry shader.  When enabled on my AMD card, FPS drops to half or less.

AMD AGS Driver extension:  AMD has a library called AGS_SDK which exposes driver extensions, one of these is direct access to barycentric coordinates in the pixel shader.  It also allows for direct access to any of the attributes from the 3 vertices that make up the triangle.
This method is very fast and works well if you have an AMD card that supports it.

float2 bary2d = AmdDxExtShaderIntrinsics_IjBarycentricCoords(AmdDxExtShaderIntrinsicsBarycentric_PerspCenter);
 //reconstruct the 3rd coordinate
float3 bary = float3(1.0 - bary2d.x - bary2d.y, bary2d.y, bary2d.x);

//extract materials
float m0 = AmdDxExtShaderIntrinsics_VertexParameterComponent(0, 1, 0);
float m1 = AmdDxExtShaderIntrinsics_VertexParameterComponent(1, 1, 0);
float m2 = AmdDxExtShaderIntrinsics_VertexParameterComponent(2, 1, 0);

Nvidia FastGeometryShader: Nvidia also have driver extensions(NVAPI), and one of these is the the "fast geometry shader" for when you only need a subset of the features geometry shaders offer.
 It should be possible to use this to pass down barycentric coordinates & materials , but I do not have an Nvidia card to test this on.

Domain Shader?: I haven't tried this method, but I think it might be possible to pass down barycentric coordinates from a domain shader?

Embed Into Vertex Data: Another option is to enlarge the vertex, and embed the barycentric coordinates and the 3 materials directly into it.
 This is probably a better fallback than the GS, although it does have the downside of reducing vertex reuse, since many vertices that were previously identical would now differ.

AVX2 Gather

Masked Gather vs Unmasked Gather

  AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false. 

  I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.

   The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.

 Load vs Gather vs Software Gather

To compare gather with load, I created a buffer and run through it in linear order summing the values.
 I forced the gathers to load from the same indices the load was operating on.  Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.

Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.

 Performance was depended upon the cache level that buffer fit into.

Buffer fits in L1

Load is ~10x faster than Gather

Buffer fits in L2

Load is ~3.5x faster than Gather

Buffer greater than L2

Load tapers off to ~2.x faster than Gather

This was all run on a Haswell, newer chips might perform differently.