# C++ SIMD

The support for these instructions is wide but not universal. Both Intel and AMD support the\
compatible version of FMA, called FMA 3, in their CPUs released since 2012-2013. See hardware support section for more info.\
Another caveat, the latency of FMA is not great, 4-5 CPU cycles on modern CPUs. If you are\
computing dot product or similar, have an inner loop which updates the accumulator, the loop will throttle to 4-5 cycles per iteration due to data dependency chain. To resolve, unroll the loop by a small factor like 4, use 4 independent accumulators, and sum them after the loop. This way each iteration of the loop handles 4 vectors independently, and the code should saturate the throughput instead of stalling on latency. See this stackoverflow answer for the sample code which computes dot product of two FP32 vectors.

<http://const.me/articles/simd/simd.pdf>