Saturday, October 28, 2017
AVX2 Gather
Masked Gather vs Unmasked Gather
AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false.
I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.
The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.
Load vs Gather vs Software Gather
To compare gather with load, I created a buffer and run through it in linear order summing the values.
I forced the gathers to load from the same indices the load was operating on. Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.
Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.
Performance was depended upon the cache level that buffer fit into.
Buffer fits in L1
Load is ~10x faster than Gather
Buffer fits in L2
Load is ~3.5x faster than Gather
Buffer greater than L2
Load tapers off to ~2.x faster than Gather
This was all run on a Haswell, newer chips might perform differently.
Labels:
AVX
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment