Thursday, December 22, 2016

gflops on various processors

This is the execution engine for Haswell.

Port 0 and 1 can both execute FMA/FMul.

I'm going to write down general Gflops ratings for commonly used CPU's, broken down by how those numbers are calculated. This is mostly for future reference for myself.

Haswell i7 4770k at 3.5ghz.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 3.5(ghz) =448 gflop

Kabylake i7 7770k: nothing much has changed here, but it is clocked at 4.2ghz.
It does have faster div/sqrt and fadd can run on two ports, but that is not reflected in flops rating.

8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 4.2(ghz) =537.6 gflop

AMD chips support AVX/AVX2, but internally it only executes 128bits at a time.

Xbox One Jaguar AMD CPU:

4(fake AVX) * 2(ports) * 8(cores)* 1.75ghz  =112 gflops

AMD Zen CPU: the exact ghz isn't know, but demonstration had it at 3.4.
It supports AVX2, but breaks it into 2x4 SSE internally(half throughput of intel)

4(fake AVX2) * 2(FMA) * 2(two FMA ports I *think*) * 8(cores) * 3.4(ghz) = 435.2 gflop

Intel Skylake Xeon added AVX512 support, unfortunately it appears AVX512 will not appear in consumer CPU's until 2018/19
 I believe intel will be upping core count to either 6 or 8 for the k line by this time.

Future Intel K chip with AVX512:
16(AVX512) * 2(FMA) * 2(two FMA ports) * 6-8(cores) * 3.5-4.2(ghz) = between 1344 to 2150 gflops

 Now Haswell can only decode 4 instructions per clock so keeping it fed with 2 FMA's per cycle is not always going to be possible.
 It takes 5 cycles to retire FMA, so you need 10 FMA's in flight to maximize throughput.
With kabylake/skylake, FMA retires in 4 cycles, so only 8 are required.

Hyperthreading can help, but again, with only 4 instructions decoded per cycle, decoding might bottleneck it.

On Haswell Port 5 can also execute integer vector ops, so if you mixed int/float it might be possible to compute above the "gflops" rating, although this would be with integer math.

Thursday, October 27, 2016

Cluster Culling

AMD's GPUOpen has an article on Cluster Culling

Basically for a given mesh cluster, you can often perform a variant of backface culling on the entire cluster. 

You do this by calculating a cone that represents the region in which the cluster is not visible. 
Any viewer located within the cone, is unable to see the cluster, so it can be culled.

 AMD implementation works like this: 

  1. Find the average normal of the cluster
  2. Take the dot product of each normal against the average normal, and find the minimum. 
  3. Use this as cone angle, anything greater than 0 can be culled in some situations.  
They also do some other work involving the bounding box, to prevent some errors cases they had to deal with. 

This is a smallest circle problem, the AMD solution using the average axis is rarely going to produce the tightest circle.

For my code I run multiple algorithms, the average, the min/max axis, and then run 1 round of ritters method over the data using whichever axis was the best. The average axis is pretty bad generally, so even just using min/max axis is a good improvement.

If you want an exact algorithm, you could try this method, although it will be slower to calculate.

The cull rate various heavily depending on the scene. It is also much more effective at higher details(smaller cluster size).  Sometimes it is only 1%, but I have seen it go up to around 15%.  

My engine does not generate clusters if they are outside the frustum or occluded, which reduces opportunities for culling. 
In a standard game engine with offline generated content the cull rate would likely be higher.

Monday, September 5, 2016

Signed Distance Field Volume Compression

Here is a voxelized stanford dragon mesh. 
Longest axis is 512 here.
 Uncompressed, the volume is ~200mb(f32). 
Spent some time working on a compression method. 
 It supports random access without prior decompression like S3/dxt.
It turns out SDFs are very compressible if you think about which information really matters.
Lossy, but not really observable. 
 New size: 3.39 mb in memory
On disk with zstd(22): ~800kb  

Here are some papers to investigate in the future, although these do not store SDF's, they are more like SVO and only store filled/not filled.

And this one for storing color information.

Monday, August 29, 2016

some links

Normal compression with SFM: better quality and faster decode than octahedral mapping, which is what I am currently using. Here is shadertoy link to an IQ's.

D3d11 Extentions:  I need barycentric coords. AMD has a d3d11 extension for it. For Nvidia a geometry shader is required, but it looks like they have a nvAPI fast geometry shader that might work.

AMD Polaris: The reduced cost for small triangles is what most interests me here

GPUOpen: ATI open source with hair, shadows, gpu compute etc

Screen Space Reflections: implementation details

C survey: undefined behavior yadda yadda

compilers blog

math stuff

LZSSE: faster decompression than lz4

small lz4 -- smaller lz4 compatible files

corner wang tiles

fractal stuff

hg_sdf + puoet

povray: list of shapes supported has some interesting shapes

custom vertex fetch: see sebbbi's post. You can manually fetch vertex data instead of relying on fixed function. Can use this to encode extra bits of data into any unused bits in your indices.  Runs well on AMD, but appears to perform very poorly on Nvidia.

Timing from Turanszkji's post:

GPU     Method        ShadowPass    ZPrepass   OpaquePass   All GPU
NVidia GTX 960  InputLayout       4.52 ms     0.37 ms    6.12 ms    15.68 ms
NVidia GTX 960  CustomFetch (typed buffer)   18.89 ms    1.31 ms    8.68 ms    33.58 ms
NVidia GTX 960  CustomFetch (RAW buffer 1)   18.29 ms    1.35 ms    8.62 ms    33.03 ms
NVidia GTX 960  CustomFetch (RAW buffer 2)   18.42 ms    1.32 ms    8.61 ms    33.18 ms
AMD RX 470   InputLayout       7.43 ms     0.29 ms    3.06 ms    14.01 ms
AMD RX 470   CustomFetch (typed buffer)   7.41 ms     0.31 ms    3.12 ms    14.08 ms
AMD RX 470   CustomFetch (RAW buffer 1)   7.50 ms     0.29 ms    3.07 ms    14.09 ms
AMD RX 470   CustomFetch (RAW buffer 2)   7.56 ms     0.28 ms    3.09 ms    14.15 ms

Sunday, August 21, 2016

Summed Area Table

For my future reference:)

A Summed area table(SAT) can be used to query the sum of values over a rectangular region.

From this you can also derive the average value, by dividing by the # of pixels in the rectangle.

It can be used as an alternative to mip mapping.

One advantage over mip mapping is that the query region can be an arbitrary rectangle, unlike mip mapping which is square.

A disadvantage is that that it requires more and more precision as you approach the lower right(the final value is the sum of all previous values).
Thus SAT generally requires increased memory.

Tuesday, June 28, 2016

std::min/max prevent autovectorization in vs2015

a < b ? a : b;    <-- auto vectorizes
std::min(a,b)   <-- does not

Another bug report for VS: std::min/max break autovectorization

VS's autovectorizer requires massaging to get anything out of it.

Another quirk:  during type conversion, don't skip steps.
For example.

float->i8  //this is skipping the step of converting to i32
float->i32 //An instruction exists for this,

So if you convert float directly to i8, autovectorization fails.
Instead you must convert to i32, and then to i8, now autovectorization succeeds.

Friday, June 24, 2016

vs2015, std::floor/trunc/ceil, and the resulting assembly

 VS2015 generates inefficient code for these instructions

float floored = std::floor(some_float);

So here is what VS generates with /AVX2 switch thrown:

00007FF6EE961016  vmovss      xmm1,dword ptr [bob]  
00007FF6EE96101C  vcvttss2si  ecx,xmm1  
00007FF6EE961020  cmp         ecx,80000000h  
00007FF6EE961026  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961028  vxorps      xmm0,xmm0,xmm0  
00007FF6EE96102C  vcvtsi2ss   xmm0,xmm0,ecx  
00007FF6EE961030  vucomiss    xmm0,xmm1  
00007FF6EE961034  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961036  vunpcklps   xmm1,xmm1,xmm1  
00007FF6EE96103A  vmovmskps   eax,xmm1  
00007FF6EE96103E  and         eax,1  
00007FF6EE961041  sub         ecx,eax  
00007FF6EE961043  vxorps      xmm1,xmm1,xmm1  
00007FF6EE961047  vcvtsi2ss   xmm1,xmm1,ecx  

Not good.

With AVX enabled I'd expect to see roundss used.

Here is a custom implementation of floor using intrinsics.

float floor_avx(float a) {
    __m128 o;
    return _mm_cvtss_f32(_mm_floor_ss(o, _mm_set_ss(a)));

And the assembly:

00007FF7461C1016  vmovss      xmm1,dword ptr [bob]  
00007FF7461C101C  vmovaps     xmm2,xmm1  
00007FF7461C1020  vmovups     xmm1,xmmword ptr [rsp+20h]  
00007FF7461C1026  vroundss    xmm3,xmm1,xmm2,1  
There seems to be a few extra moves here for whatever reason, but at least it is in the ballpark of reasonable.

 The same problem exists for std::trunc, std::ceil, and applies to both float and double.

Anyway I reported this on Connect(floor/ceil/trunc), although my experience in the past with Connect has not been great..

Well, hopefully they fix this one..

Here is what std::trunc generates: It calls a function, instead of using roundss

00007FF750091016  vmovss      xmm0,dword ptr [bob]
00007FF75009101C  call        qword ptr [__imp_truncf (07FF750092108h)]

(Edit: VS2017 is better, but still misses some optimizations with std::trunc and std::round)
godbolt link for x64

Wednesday, June 1, 2016

AVX2, how to Pack Left

If you have an input array, and an output array, and you only want to write those elements which pass a condition, what is the most efficient way to do this with AVX2?

Here is a visualization of the problem:
Here is my solution, using compressed indices. It requires a LUT sized 769 bytes, so it is best suited for cases where you have a good sized array of data to work on.

//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(int moveMask) {
    u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
    __m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT

    __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));

    //now shift it right to get 3 bits at bottom
    __m256i shufmask = _mm256_srli_epi32(m, 29);
    return shufmask;
//The rest of this code to build the LUT
u32 get_nth_bits(int a) {
    u32 out = 0;
    int c = 0;
    for (int i = 0; i < 8; ++i) {
        auto set = (a >> i) & 1;
        if (set) {
            out |= (i << (c * 3));
    return out;
u8 g_pack_left_table_u8x3[256 * 3 + 1];

void BuildPackMask() {
    for (int i = 0; i < 256; ++i) {
        *reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
On stackoverflow Peter Cordes came up with a solution that is clever, it avoids the requirement for a LUT by taking advantage of the new BMI(bit manipulation) instruction set. I had not used the BMI instructions before, so this was new to me.
 This code is x64 only, but you can port to x86 by using the vector shift approach I used ^, and the 3 bit indices instead of 8 bit.
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);