Monday, August 21, 2017

Moment of Inertia of a Distance Field

While adding physics support to my voxel engine I needed a reasonable accurate & fast method to calculate the moment of inertia, volume, and center of mass for an arbitrary distance field.

I've written this C++ code to do so.
Feed it a regularly spaced grid of points and distances.
The points must fully bound the negative space of the field to be accurate.

The common primitives, such as a sphere, there are established formulas that we can compare against.

For a 10^3 distance field of a sphere, the estimate is off by about 1%, close enough for me.
If more accuracy is needed, the sample rate can be increased.

Thursday, January 5, 2017

Video of my Distance Field Engine

This is a short clip showing my engine. 
It uses distance fields to represent all geometry.
Don't confuse it with a minecraft like engine--those use 1 bit filled/not filled.
It is written in very SIMD heavy C++.

Thursday, December 22, 2016

gflops on various processors

This is the execution engine for Haswell.

Port 0 and 1 can both execute FMA/FMul.

I'm going to write down general Gflops ratings for commonly used CPU's, broken down by how those numbers are calculated. This is mostly for future reference for myself.

Haswell i7 4770k at 3.5ghz.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 3.5(ghz) =448 gflop

Kabylake i7 7770k: nothing much has changed here, but it is clocked at 4.2ghz.
It does have faster div/sqrt and fadd can run on two ports, but that is not reflected in flops rating.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 4.2(ghz) =537.6 gflop

AMD chips support AVX/AVX2, but internally it only executes 128bits at a time.

Xbox One Jaguar AMD CPU:

4(fake AVX) * 2(ports) * 8(cores)* 1.75ghz  =112 gflops

AMD Zen CPU: the exact ghz isn't know, but demonstration had it at 3.4.
It supports AVX2, but breaks it into 2x4 SSE internally(half throughput of intel)

4(fake AVX2) * 2(FMA) * 2(two FMA ports I *think*) * 8(cores) * 3.4(ghz) = 435.2 gflop

Intel Skylake Xeon added AVX512 support, unfortunately it appears AVX512 will not appear in consumer CPU's until 2018/19
 I believe intel will be upping core count to either 6 or 8 for the k line by this time.

Future Intel K chip with AVX512:
16(AVX512) * 2(FMA) * 2(two FMA ports) * 6-8(cores) * 3.5-4.2(ghz) = between 1344 to 2150 gflops

 Now Haswell can only decode 4 instructions per clock so keeping it fed with 2 FMA's per cycle is not always going to be possible.
 It takes 5 cycles to retire FMA, so you need 10 FMA's in flight to maximize throughput.
With kabylake/skylake, FMA retires in 4 cycles, so only 8 are required.

Hyperthreading can help, but again, with only 4 instructions decoded per cycle, decoding might bottleneck it.

On Haswell Port 5 can also execute integer vector ops, so if you mixed int/float it might be possible to compute above the "gflops" rating, although this would be with integer math.

Thursday, October 27, 2016

Cluster Culling

AMD's GPUOpen has an article on Cluster Culling

Basically for a given mesh cluster, you can often perform a variant of backface culling on the entire cluster. 

You do this by calculating a cone that represents the region in which the cluster is not visible. 
Any viewer located within the cone, is unable to see the cluster, so it can be culled.

 AMD implementation works like this: 

  1. Find the average normal of the cluster
  2. Take the dot product of each normal against the average normal, and find the minimum. 
  3. Use this as cone angle, anything greater than 0 can be culled in some situations.  
They also do some other work involving the bounding box, to prevent some errors cases they had to deal with. 

This is a smallest circle problem, the AMD solution using the average axis is rarely going to produce the tightest circle.

For my code I run multiple algorithms, the average, the min/max axis, and then run 1 round of ritters method over the data using whichever axis was the best. The average axis is pretty bad generally, so even just using min/max axis is a good improvement.

If you want an exact algorithm, you could try this method, although it will be slower to calculate.

The cull rate various heavily depending on the scene. It is also much more effective at higher details(smaller cluster size).  Sometimes it is only 1%, but I have seen it go up to around 15%.  

My engine does not generate clusters if they are outside the frustum or occluded, which reduces opportunities for culling. 
In a standard game engine with offline generated content the cull rate would likely be higher.

Monday, September 5, 2016

Signed Distance Field Volume Compression

Here is a voxelized stanford dragon mesh. 
Longest axis is 512 here.
 Uncompressed, the volume is ~200mb(f32). 
Spent some time working on a compression method. 
 It supports random access without prior decompression like S3/dxt.
It turns out SDFs are very compressible if you think about which information really matters.
Lossy, but not really observable. 
 New size: 3.39 mb in memory
On disk with zstd(22): ~800kb  

Here are some papers to investigate in the future, although these do not store SDF's, they are more like SVO and only store filled/not filled.

And this one for storing color information.

Monday, August 29, 2016

some links

Normal compression with SFM: better quality and faster decode than octahedral mapping, which is what I am currently using. Here is shadertoy link to an IQ's.

AMD Polaris: The reduced cost for small triangles is what most interests me here

GPUOpen: ATI open source with hair, shadows, gpu compute etc

Screen Space Reflections: implementation details

C survey: undefined behavior yadda yadda

compilers blog

math stuff

LZSSE: faster decompression than lz4

small lz4 -- smaller lz4 compatible files

corner wang tiles

fractal stuff

hg_sdf + puoet

povray: list of shapes supported has some interesting shapes

Sunday, August 21, 2016

Summed Area Table

For my future reference:)

A Summed area table(SAT) can be used to query the sum of values over a rectangular region.

From this you can also derive the average value, by dividing by the # of pixels in the rectangle.

It can be used as an alternative to mip mapping.

One advantage over mip mapping is that the query region can be an arbitrary rectangle, unlike mip mapping which is square.

A disadvantage is that that it requires more and more precision as you approach the lower right(the final value is the sum of all previous values).
Thus SAT generally requires increased memory.