If you have an input array, and an output array, and you only want to write those elements which pass a condition, what is the most efficient way to do this with AVX2?
Here is a visualization of the problem:
On stackoverflow Peter Cordes came up with a solution that is clever, it avoids the requirement for a LUT by taking advantage of the new BMI(bit manipulation) instruction set. I had not used the BMI instructions before, so this was new to me.
This code is x64 only, but you can port to x86 by using the vector shift approach I used ^, and the 3 bit indices instead of 8 bit.