Skip to main content

C++ SIMD

The support for these instructions is wide but not universal. Both Intel and AMD support the
compatible version of FMA, called FMA 3, in their CPUs released since 2012-2013. See hardware
 support section for more info.
Another caveat, the latency of FMA is not great, 4-5 CPU cycles on modern CPUs. If you’reyou are
computing dot product or similar, have an inner loop which updates the accumulator, the loop will
 throttle to 4-5 cycles per iteration due to data dependency chain. To resolve, unroll the loop by a
 small factor like 4, use 4 independent accumulators, and sum them after the loop. This way each
 iteration of the loop handles 4 vectors independently, and the code should saturate the throughput
 instead of stalling on latency. See this stackoverflow answer for the sample code which computes
 dot product of two FP32 vectors.

http://const.me/articles/simd/simd.pdf