<-- This is the execution engine for Haswell.
Port 0 and 1 can both execute FMA/FMul.
I'm going to write down general Gflops ratings for commonly used CPU's, broken down by how those numbers are calculated. This is mostly for future reference for myself.
Haswell i7 4770k at 3.5ghz.
8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 3.5(ghz) =448 gflop
Kabylake i7 7770k: nothing much has changed here, but it is clocked at 4.2ghz.
It does have faster div/sqrt and fadd can run on two ports, but that is not reflected in flops rating.
8(AVX) * 2(FMA) * 2(two FMA ports) * 4(cores) * 4.2(ghz) =537.6 gflop
AMD chips support AVX/AVX2, but internally it only executes 128bits at a time.
Xbox One Jaguar AMD CPU:
4(fake AVX) * 2(ports) * 8(cores)* 1.75ghz =112 gflops
AMD Zen CPU: the exact ghz isn't know, but demonstration had it at 3.4.
It supports AVX2, but breaks it into 2x4 SSE internally(half throughput of intel)
4(fake AVX2) * 2(FMA) * 2(two FMA ports I *think*) * 8(cores) * 3.4(ghz) = 435.2 gflop
Intel Skylake Xeon added AVX512 support, unfortunately it appears AVX512 will not appear in consumer CPU's until 2018/19
I believe intel will be upping core count to either 6 or 8 for the k line by this time.
Future Intel K chip with AVX512:
16(AVX512) * 2(FMA) * 2(two FMA ports) * 6-8(cores) * 3.5-4.2(ghz) = between 1344 to 2150 gflops
Now Haswell can only decode 4 instructions per clock so keeping it fed with 2 FMA's per cycle is not always going to be possible.
It takes 5 cycles to retire FMA, so you need 10 FMA's in flight to maximize throughput.
With kabylake/skylake, FMA retires in 4 cycles, so only 8 are required.
Hyperthreading can help, but again, with only 4 instructions decoded per cycle, decoding might bottleneck it.
On Haswell Port 5 can also execute integer vector ops, so if you mixed int/float it might be possible to compute above the "gflops" rating, although this would be with integer math.