Tuesday, June 28, 2016

std::min/max prevent autovectorization in vs2015

a < b ? a : b;    <-- auto vectorizes
std::min(a,b)   <-- does not

Another bug report for VS: std::min/max break autovectorization

VS's autovectorizer requires massaging to get anything out of it.

Another quirk:  during type conversion, don't skip steps.
For example.

float->i8  //this is skipping the step of converting to i32
float->i32 //An instruction exists for this,

So if you convert float directly to i8, autovectorization fails.
Instead you must convert to i32, and then to i8, now autovectorization succeeds.

Friday, June 24, 2016

vs2015, std::floor/trunc/ceil, and the resulting assembly

 VS2015 generates inefficient code for these instructions

float floored = std::floor(some_float);

So here is what VS generates with /AVX2 switch thrown:

00007FF6EE961016  vmovss      xmm1,dword ptr [bob]  
00007FF6EE96101C  vcvttss2si  ecx,xmm1  
00007FF6EE961020  cmp         ecx,80000000h  
00007FF6EE961026  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961028  vxorps      xmm0,xmm0,xmm0  
00007FF6EE96102C  vcvtsi2ss   xmm0,xmm0,ecx  
00007FF6EE961030  vucomiss    xmm0,xmm1  
00007FF6EE961034  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961036  vunpcklps   xmm1,xmm1,xmm1  
00007FF6EE96103A  vmovmskps   eax,xmm1  
00007FF6EE96103E  and         eax,1  
00007FF6EE961041  sub         ecx,eax  
00007FF6EE961043  vxorps      xmm1,xmm1,xmm1  
00007FF6EE961047  vcvtsi2ss   xmm1,xmm1,ecx  

Not good.

With AVX enabled I'd expect to see roundss used.

Here is a custom implementation of floor using intrinsics.

float floor_avx(float a) {
    __m128 o;
    return _mm_cvtss_f32(_mm_floor_ss(o, _mm_set_ss(a)));

And the assembly:

00007FF7461C1016  vmovss      xmm1,dword ptr [bob]  
00007FF7461C101C  vmovaps     xmm2,xmm1  
00007FF7461C1020  vmovups     xmm1,xmmword ptr [rsp+20h]  
00007FF7461C1026  vroundss    xmm3,xmm1,xmm2,1  
There seems to be a few extra moves here for whatever reason, but at least it is in the ballpark of reasonable.

 The same problem exists for std::trunc, std::ceil, and applies to both float and double.

Anyway I reported this on Connect(floor/ceil/trunc), although my experience in the past with Connect has not been great..

Well, hopefully they fix this one..

Here is what std::trunc generates: It calls a function, instead of using roundss

00007FF750091016  vmovss      xmm0,dword ptr [bob]
00007FF75009101C  call        qword ptr [__imp_truncf (07FF750092108h)]

(Edit: VS2017 is better, but still misses some optimizations with std::trunc and std::round)
godbolt link for x64

Wednesday, June 1, 2016

AVX2, how to Pack Left

If you have an input array, and an output array, and you only want to write those elements which pass a condition, what is the most efficient way to do this with AVX2?

Here is a visualization of the problem:
Here is my solution, using compressed indices. It requires a LUT sized 769 bytes, so it is best suited for cases where you have a good sized array of data to work on.

//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(int moveMask) {
    u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
    __m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT

    __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));

    //now shift it right to get 3 bits at bottom
    __m256i shufmask = _mm256_srli_epi32(m, 29);
    return shufmask;
//The rest of this code to build the LUT
u32 get_nth_bits(int a) {
    u32 out = 0;
    int c = 0;
    for (int i = 0; i < 8; ++i) {
        auto set = (a >> i) & 1;
        if (set) {
            out |= (i << (c * 3));
    return out;
u8 g_pack_left_table_u8x3[256 * 3 + 1];

void BuildPackMask() {
    for (int i = 0; i < 256; ++i) {
        *reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
On stackoverflow Peter Cordes came up with a solution that is clever, it avoids the requirement for a LUT by taking advantage of the new BMI(bit manipulation) instruction set. I had not used the BMI instructions before, so this was new to me.
 This code is x64 only, but you can port to x86 by using the vector shift approach I used ^, and the 3 bit indices instead of 8 bit.
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);