C++ SIMD
The support for these instructions is wide but not universal. Both Intel and AMD support the
compatible version of FMA, called FMA 3, in their CPUs released since 2012-2013. See hardware
support section for more info.
Another caveat, the latency of FMA is not great, 4-5 CPU cycles on modern CPUs. If you’reyou are
computing dot product or similar, have an inner loop which updates the accumulator, the loop will
throttle to 4-5 cycles per iteration due to data dependency chain. To resolve, unroll the loop by a
small factor like 4, use 4 independent accumulators, and sum them after the loop. This way each
iteration of the loop handles 4 vectors independently, and the code should saturate the throughput
instead of stalling on latency. See this stackoverflow answer for the sample code which computes
dot product of two FP32 vectors.