Friday, June 24, 2016

vs2015, std::floor/trunc/ceil, and the resulting assembly

 VS2015 generates inefficient code for these instructions

float floored = std::floor(some_float);

So here is what VS generates with /AVX2 switch thrown:

00007FF6EE961016  vmovss      xmm1,dword ptr [bob]  
00007FF6EE96101C  vcvttss2si  ecx,xmm1  
00007FF6EE961020  cmp         ecx,80000000h  
00007FF6EE961026  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961028  vxorps      xmm0,xmm0,xmm0  
00007FF6EE96102C  vcvtsi2ss   xmm0,xmm0,ecx  
00007FF6EE961030  vucomiss    xmm0,xmm1  
00007FF6EE961034  je          main+4Bh (07FF6EE96104Bh)  
00007FF6EE961036  vunpcklps   xmm1,xmm1,xmm1  
00007FF6EE96103A  vmovmskps   eax,xmm1  
00007FF6EE96103E  and         eax,1  
00007FF6EE961041  sub         ecx,eax  
00007FF6EE961043  vxorps      xmm1,xmm1,xmm1  
00007FF6EE961047  vcvtsi2ss   xmm1,xmm1,ecx  

Not good.

With AVX enabled I'd expect to see roundss used.

Here is a custom implementation of floor using intrinsics.

float floor_avx(float a) {
    __m128 o;
    return _mm_cvtss_f32(_mm_floor_ss(o, _mm_set_ss(a)));

And the assembly:

00007FF7461C1016  vmovss      xmm1,dword ptr [bob]  
00007FF7461C101C  vmovaps     xmm2,xmm1  
00007FF7461C1020  vmovups     xmm1,xmmword ptr [rsp+20h]  
00007FF7461C1026  vroundss    xmm3,xmm1,xmm2,1  
There seems to be a few extra moves here for whatever reason, but at least it is in the ballpark of reasonable.

 The same problem exists for std::trunc, std::ceil, and applies to both float and double.

Anyway I reported this on Connect(floor/ceil/trunc), although my experience in the past with Connect has not been great..

Well, hopefully they fix this one..

Here is what std::trunc generates: It calls a function, instead of using roundss

00007FF750091016  vmovss      xmm0,dword ptr [bob]
00007FF75009101C  call        qword ptr [__imp_truncf (07FF750092108h)]

(Edit: VS2017 is better, but still misses some optimizations with std::trunc and std::round)
godbolt link for x64

