Monday, November 27, 2017

GPU Texture Format for PBR

Visuals with PBR.  


Objective: encode a full PBR set of channels into as few bits as possible, with good performance.

From testing it appears that each additional texture adds significant cost, using 3 textures to encode the PBR signal with a size of 2 bytes per texel is much more costly than 2 textures at the same overall number of bytes per texel.

The PBR system I'm using has 8 channels:

3 base color
2 normal 
1 roughness
1 metallic
1 AO

 If we attempt to store the normal in BC5, a two channel format designed specifically for tangent space normals, we have 6 channels remaining, and cannot fit that into a single texture, as none of them support more than 4 channels.
So we cannot use BC5.

 There are two good options I've found instead, both using the same layout, two textures, both with 4 channels.
On anything D3D11 and newer, BC7 can be used.
For pre-D3D11 systems, BC3 textures can be used instead. 

The normal will be split and stored into the alpha channels, which should help preserve precision of the normal.

Texture1: RGB: base color A: normal.x 
Texture2: RGB: ao, roughness, metallic.  A: normal.y 

*Texture1 can be safety set to SRGB, as both BC3 and BC7 treat the alpha as linear.

Uncompressed signal: 8 bytes 
Compressed: 2 bytes in both BC3 and BC7 formats

Encoding speed:

Using AMD Compressonator BC3 is fast to encode, even with quality set high it churn through BC3 fairly quickly.

Another encoder I tested was Crunch, a BC1/BC3 compressor that applies a lossy entropy reduction algorithm on top of the lossy block compression algorithm- this enables crunched BC1/3 files to compress much smaller on disk.
I decided not to use it because the compressor was very slow, and I feel that BC1 already looks less than stellar(the endpoints are 565..)-- throw in even more artifacts from Crunch and the textures just didn't look very good.

AMD Compressonators BC7 encoding is not nearly as fast as its BC3. 
This is understandable as the format is vastly more complex.

With the quality set to low, it still takes much longer than BC3 at high quality. 

BC format Impact on Rendering
There is no observable difference in rendering performance between BC3 and BC7 on my AMD 280x.  
Both are observably faster than uncompressed, not surprising given that uncompressed is 4x larger.

BC3 vs BC7 Visual Quality: 

I have only run BC7 high quality on a few images, I'd probably have to run it overnight and then some to generate high quality BC7 for everything.

 Comparing low quality BC7 vs high quality BC3:

BC3's RGB part(identical to BC1), can only encode 4 possible colors in each 4x4 region, BC7 is far less limited.

For noisey images the difference isn't all that noticeable, but if you look closely BC7 generally has slightly more detail.

For anything with smooth gradients BC7 is clearly superior.


BC3 has dedicated 8 bit end points and 3 bit indices for the alpha channel, while BC7 may or may not even have dedicated indices for alpha, as this is chosen on a per block basis. 

There is no obvious difference in the normals, but when I zoom in I can occasional spot areas where BC3 appears to have done a better job, but this is rare, and the overall improvements in the other channels is larger improvement than this small loss. Also running BC7 high quality may change this--

 Size on Disk: 
Both BC3 and BC7 are 8 bits per pixel
When bit compressed, in this case with zstd, the BC7 files are generally about 1-2% smaller.

I tried lzham(an LZMA variant), but the files are only about 5% smaller than zstd level 19, not worth the 5x slower decode.

Possible/Future Improvements:

1.  Quality of all channels can be improved by tracking min/max for the entire image and then re-normalizing it. This would require 2 floats per channel in the shader to decode though.

2. The normals in the normal map are in euclidean space, this wastes bits since some values are never used. Octahedral coordinates make better use of the available bits, and decoding isn't really much different.

Metal channel is active for many of the objects seen here

Tuesday, October 31, 2017

Higher Quality Vertex Normals

 I use the Oct16 format to encode my vertex normals, this format is two 8 bit channels in octahedral mapping.

  Most of the time this was sufficient, but under certain conditions artifacts were visible-- such as the surface of a smoothly varying sphere using triplanar texturing, whose weights are based on the normals.

Here is a visualization of the Triplanar weights generated from the Oct16 normals.


There is a very obvious diamond pattern visible.
Even switching to Oct20(10 bits per channel) does not completely solve this, the diamonds are much smaller, but they persist.

Oct16, but with custom scale/bias

Instead of adding bits, I decided to take advantage of the fact that most triangle patches only use a
limited range of the world space normals.

I track min/max per channel for the entire patch, then encode the normals so that the full range of bits is used.

Decoding in the shader requires a custom scale and bias parameter per channel(4 floats for the two channel Oct16).

There are no extra instructions,  as a fixed scale of 2 and bias of -1 was previously being used to transform from [0,1] to [-1,1] range.

The 2nd image was encoded this way, the normals are still using Oct16, so only 16 bits per normal, but with a custom scale/bias per patch.

 In the majority of cases this provides many extra bits of precision, and in the worst case it degrades back to standard Oct16.

Monday, October 30, 2017

Faster Triplanar Texturing

Here is a method I created to improve performance when using Triplanar texturing.
I also think it looks better.

So the standard triplanar texturing algorithm you will find in varous places on the internet looks something like this.

float3 TriPlanarBlendWeightsStandard(float3 normal) {
float3 blend_weights = abs(; 
blend_weights = (blend_weights - 0.55);
blend_weights = max(blend_weights, 0);   
float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

If we visualize the blend zones this is what it looks like.

Red/Green/Blue represent one texture sample.

Yellow/pink/cyan represent two textures samples.

And in the white corner we need all three.

As we can see the blend width is not constant, it is very small in the corner and quite wide along axis aligned edges.

The corner has barely any blending as we have pushed our blend zone out as far as possible by subtracting .55.(anything over 1/sqrt(3) or 0.577 results in negative blend zones in the corner).

This results in needless texture sampling along aligned edges, stealing away our precious bandwidth.

Constant Blend Width

What we want is something more like this-- constant blend width.

We do this by working in max norm distance instead of euclidean,  as our planes are axis aligned anyway--

Here is the modified code that generates this:
float3 TriPlanarBlendWeightsConstantOverlap(float3 normal) {

//float3 blend_weights =  abs(normal);
float3 blend_weights = normal*normal;
float maxBlend = max(blend_weights.x, max(blend_weights.y, blend_weights.z));
blend_weights = blend_weights - maxBlend*0.9f;

blend_weights = max(blend_weights, 0);   

float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

 You can adjust the blend width by changing the scalar .9 value.

On my GPU the constant version runs slightly faster, likely because there are less pixels where more than one texture sample is required.

I believe it also looks better--as there is less smearing along axis aligned edges.

Here is a shadertoy I created if you want to play with it

Saturday, October 28, 2017

Barycentric Coordinates in Pixel Shader

I needed a way to perform smooth blending between per vertex materials.
Basically I needed barycentric coordinates + access to each vertices material in the pixel shader.

Geometry Shader method:  Assign the coordinates: (1,0,0), (0,1,0), (0,0,1) to the vertices of the triangle.  Also write the three materials to each vertex.
This method is easy to implement but has terrible performance on many cards, since it requires a geometry shader.  When enabled on my AMD card, FPS drops to half or less.

AMD AGS Driver extension:  AMD has a library called AGS_SDK which exposes driver extensions, one of these is direct access to barycentric coordinates in the pixel shader.  It also allows for direct access to any of the attributes from the 3 vertices that make up the triangle.
This method is very fast and works well if you have an AMD card that supports it.

float2 bary2d = AmdDxExtShaderIntrinsics_IjBarycentricCoords(AmdDxExtShaderIntrinsicsBarycentric_PerspCenter);
 //reconstruct the 3rd coordinate
float3 bary = float3(1.0 - bary2d.x - bary2d.y, bary2d.y, bary2d.x);

//extract materials
float m0 = AmdDxExtShaderIntrinsics_VertexParameterComponent(0, 1, 0);
float m1 = AmdDxExtShaderIntrinsics_VertexParameterComponent(1, 1, 0);
float m2 = AmdDxExtShaderIntrinsics_VertexParameterComponent(2, 1, 0);

Nvidia FastGeometryShader: Nvidia also have driver extensions(NVAPI), and one of these is the the "fast geometry shader" for when you only need a subset of the features geometry shaders offer.
 It should be possible to use this to pass down barycentric coordinates & materials , but I do not have an Nvidia card to test this on.

Domain Shader?: I haven't tried this method, but I think it might be possible to pass down barycentric coordinates from a domain shader?

Embed Into Vertex Data: Another option is to enlarge the vertex, and embed the barycentric coordinates and the 3 materials directly into it.
 This is probably a better fallback than the GS, although it does have the downside of reducing vertex reuse, since many vertices that were previously identical would now differ.

AVX2 Gather

Masked Gather vs Unmasked Gather

  AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false. 

  I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.

   The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.

 Load vs Gather vs Software Gather

To compare gather with load, I created a buffer and run through it in linear order summing the values.
 I forced the gathers to load from the same indices the load was operating on.  Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.

Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.

 Performance was depended upon the cache level that buffer fit into.

Buffer fits in L1

Load is ~10x faster than Gather

Buffer fits in L2

Load is ~3.5x faster than Gather

Buffer greater than L2

Load tapers off to ~2.x faster than Gather

This was all run on a Haswell, newer chips might perform differently.

Monday, August 21, 2017

Moment of Inertia of a Distance Field

While adding physics support to my voxel engine I needed a reasonable accurate & fast method to calculate the moment of inertia, volume, and center of mass for an arbitrary distance field.

I've written this C++ code to do so.
Feed it a regularly spaced grid of points and distances.
The points must fully bound the negative space of the field to be accurate.

The common primitives, such as a sphere, there are established formulas that we can compare against.

For a 10^3 distance field of a sphere, the estimate is off by about 1%, close enough for me.
If more accuracy is needed, the sample rate can be increased.

Thursday, January 5, 2017

Video of my Distance Field Engine

This is a short clip showing my engine. 
It uses distance fields to represent all geometry.
Don't confuse it with a minecraft like engine--those use 1 bit filled/not filled.
It is written in very SIMD heavy C++.