Tuesday, October 31, 2017

Higher Quality Vertex Normals

 I use the Oct16 format to encode my vertex normals, this format is two 8 bit channels in octahedral mapping.

  Most of the time this was sufficient, but under certain conditions artifacts were visible-- such as the surface of a smoothly varying sphere using triplanar texturing, whose weights are based on the normals.

Here is a visualization of the Triplanar weights generated from the Oct16 normals.


There is a very obvious diamond pattern visible.
Even switching to Oct20(10 bits per channel) does not completely solve this, the diamonds are much smaller, but they persist.

Oct16, but with custom scale/bias

Instead of adding bits, I decided to take advantage of the fact that most triangle patches only use a
limited range of the world space normals.

I track min/max per channel for the entire patch, then encode the normals so that the full range of bits is used.

Decoding in the shader requires a custom scale and bias parameter per channel(4 floats for the two channel Oct16).

There are no extra instructions,  as a fixed scale of 2 and bias of -1 was previously being used to transform from [0,1] to [-1,1] range.

The 2nd image was encoded this way, the normals are still using Oct16, so only 16 bits per normal, but with a custom scale/bias per patch.

 In the majority of cases this provides many extra bits of precision, and in the worst case it degrades back to standard Oct16.

Monday, October 30, 2017

Faster Triplanar Texturing

Here is a method I created to improve performance when using Triplanar texturing.
I also think it looks better.

So the standard triplanar texturing algorithm you will find in varous places on the internet looks something like this.

float3 TriPlanarBlendWeightsStandard(float3 normal) {
float3 blend_weights = abs(; 
blend_weights = (blend_weights - 0.55);
blend_weights = max(blend_weights, 0);   
float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

If we visualize the blend zones this is what it looks like.

Red/Green/Blue represent one texture sample.

Yellow/pink/cyan represent two textures samples.

And in the white corner we need all three.

As we can see the blend width is not constant, it is very small in the corner and quite wide along axis aligned edges.

The corner has barely any blending as we have pushed our blend zone out as far as possible by subtracting .55.(anything over 1/sqrt(3) or 0.577 results in negative blend zones in the corner).

This results in needless texture sampling along aligned edges, stealing away our precious bandwidth.

Constant Blend Width

What we want is something more like this-- constant blend width.

We do this by working in max norm distance instead of euclidean,  as our planes are axis aligned anyway--

Here is the modified code that generates this:
float3 TriPlanarBlendWeightsConstantOverlap(float3 normal) {

//float3 blend_weights =  abs(normal);
float3 blend_weights = normal*normal;
float maxBlend = max(blend_weights.x, max(blend_weights.y, blend_weights.z));
blend_weights = blend_weights - maxBlend*0.9f;

blend_weights = max(blend_weights, 0);   

float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;

 You can adjust the blend width by changing the scalar .9 value.

On my GPU the constant version runs slightly faster, likely because there are less pixels where more than one texture sample is required.

I believe it also looks better--as there is less smearing along axis aligned edges.

Here is a shadertoy I created if you want to play with it

Saturday, October 28, 2017

Barycentric Coordinates in Pixel Shader

I needed a way to perform smooth blending between per vertex materials.
Basically I needed barycentric coordinates + access to each vertices material in the pixel shader.

Geometry Shader method:  Assign the coordinates: (1,0,0), (0,1,0), (0,0,1) to the vertices of the triangle.  Also write the three materials to each vertex.
This method is easy to implement but has terrible performance on many cards, since it requires a geometry shader.  When enabled on my AMD card, FPS drops to half or less.

AMD AGS Driver extension:  AMD has a library called AGS_SDK which exposes driver extensions, one of these is direct access to barycentric coordinates in the pixel shader.  It also allows for direct access to any of the attributes from the 3 vertices that make up the triangle.
This method is very fast and works well if you have an AMD card that supports it.

float2 bary2d = AmdDxExtShaderIntrinsics_IjBarycentricCoords(AmdDxExtShaderIntrinsicsBarycentric_PerspCenter);
 //reconstruct the 3rd coordinate
float3 bary = float3(1.0 - bary2d.x - bary2d.y, bary2d.y, bary2d.x);

//extract materials
float m0 = AmdDxExtShaderIntrinsics_VertexParameterComponent(0, 1, 0);
float m1 = AmdDxExtShaderIntrinsics_VertexParameterComponent(1, 1, 0);
float m2 = AmdDxExtShaderIntrinsics_VertexParameterComponent(2, 1, 0);

Nvidia FastGeometryShader: Nvidia also have driver extensions(NVAPI), and one of these is the the "fast geometry shader" for when you only need a subset of the features geometry shaders offer.
 It should be possible to use this to pass down barycentric coordinates & materials , but I do not have an Nvidia card to test this on.

Domain Shader?: I haven't tried this method, but I think it might be possible to pass down barycentric coordinates from a domain shader?

Embed Into Vertex Data: Another option is to enlarge the vertex, and embed the barycentric coordinates and the 3 materials directly into it.
 This is probably a better fallback than the GS, although it does have the downside of reducing vertex reuse, since many vertices that were previously identical would now differ.

AVX2 Gather

Masked Gather vs Unmasked Gather

  AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false. 

  I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.

   The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.

 Load vs Gather vs Software Gather

To compare gather with load, I created a buffer and run through it in linear order summing the values.
 I forced the gathers to load from the same indices the load was operating on.  Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.

Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.

 Performance was depended upon the cache level that buffer fit into.

Buffer fits in L1

Load is ~10x faster than Gather

Buffer fits in L2

Load is ~3.5x faster than Gather

Buffer greater than L2

Load tapers off to ~2.x faster than Gather

This was all run on a Haswell, newer chips might perform differently.

Monday, August 21, 2017

Moment of Inertia of a Distance Field

While adding physics support to my voxel engine I needed a reasonable accurate & fast method to calculate the moment of inertia, volume, and center of mass for an arbitrary distance field.

I've written this C++ code to do so.
Feed it a regularly spaced grid of points and distances.
The points must fully bound the negative space of the field to be accurate.

The common primitives, such as a sphere, there are established formulas that we can compare against.

For a 10^3 distance field of a sphere, the estimate is off by about 1%, close enough for me.
If more accuracy is needed, the sample rate can be increased.

Thursday, January 5, 2017

Video of my Distance Field Engine

This is a short clip showing my engine. 
It uses distance fields to represent all geometry.
Don't confuse it with a minecraft like engine--those use 1 bit filled/not filled.
It is written in very SIMD heavy C++.

Thursday, December 22, 2016

gflops on various processors

This is the execution engine for Haswell.

Port 0 and 1 can both execute FMA/FMul.

I'm going to write down general Gflops ratings for commonly used CPU's, broken down by how those numbers are calculated. This is mostly for future reference for myself.

Haswell i7 4770k at 3.5ghz.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 3.5(ghz) =448 gflop

Kabylake i7 7770k: nothing much has changed here, but it is clocked at 4.2ghz.
It does have faster div/sqrt and fadd can run on two ports, but that is not reflected in flops rating.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 4.2(ghz) =537.6 gflop

AMD chips support AVX/AVX2, but internally it only executes 128bits at a time.

Xbox One Jaguar AMD CPU:

4(fake AVX) * 2(ports) * 8(cores)* 1.75ghz  =112 gflops

AMD Zen CPU: the exact ghz isn't know, but demonstration had it at 3.4.
It supports AVX2, but breaks it into 2x4 SSE internally(half throughput of intel)

4(fake AVX2) * 2(FMA) * 2(two FMA ports I *think*) * 8(cores) * 3.4(ghz) = 435.2 gflop

Intel Skylake Xeon added AVX512 support, unfortunately it appears AVX512 will not appear in consumer CPU's until 2018/19
 I believe intel will be upping core count to either 6 or 8 for the k line by this time.

Future Intel K chip with AVX512:
16(AVX512) * 2(FMA) * 2(two FMA ports) * 6-8(cores) * 3.5-4.2(ghz) = between 1344 to 2150 gflops

 Now Haswell can only decode 4 instructions per clock so keeping it fed with 2 FMA's per cycle is not always going to be possible.
 It takes 5 cycles to retire FMA, so you need 10 FMA's in flight to maximize throughput.
With kabylake/skylake, FMA retires in 4 cycles, so only 8 are required.

Hyperthreading can help, but again, with only 4 instructions decoded per cycle, decoding might bottleneck it.

On Haswell Port 5 can also execute integer vector ops, so if you mixed int/float it might be possible to compute above the "gflops" rating, although this would be with integer math.