Hi there
I took the idea suggested in the above posts and extended the s2s macro to support
xmm and
ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.
An example of the current implementation:
local R1:RECT, R2:RECT
s2s R1, R2, xmm0
This copies the entire structure in a single loop.
If the structure is larger, additional registers can be added to the argument list:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3
Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdx
All these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.
A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.
I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.
I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic:
http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 and may change in the future.
Biterider
PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.