Saturday, October 28, 2017

AVX2 Gather

Masked Gather vs Unmasked Gather

  AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false. 

  I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.

   The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.

 Load vs Gather vs Software Gather

To compare gather with load, I created a buffer and run through it in linear order summing the values.
 I forced the gathers to load from the same indices the load was operating on.  Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.

Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.

 Performance was depended upon the cache level that buffer fit into.

Buffer fits in L1

Load is ~10x faster than Gather

Buffer fits in L2

Load is ~3.5x faster than Gather

Buffer greater than L2

Load tapers off to ~2.x faster than Gather

This was all run on a Haswell, newer chips might perform differently.

No comments:

Post a Comment