But aligning is so simple that I don't mind doing it anyway
If the code gets any faster with alignment, it makes sense in an innermost loop with a Million iterations. Otherwise it bloats your exe, pollutes the data cache, and thus may slow down the whole program.
Thanks for macro jj
Thanks for a timing test idea:
Align 16 data with sse code,so you easily can use mulps,divps etc with variables in memory
Vs you are forced to not be able to use memory aligned data with simd,so instead you use lots of movups before innerloop and innerloop makes use of all 16 xmm regs in 64bit mode for all mulps etc is reg to reg,all variables are kept in .xmm regs
And testrun this several million times