MASM32 Downloads
movaps [rdx+0],xmm4 ; [0 4 8 C] movaps [rdx+16],xmm5 ; [1 5 9 D] movaps [rdx+32],xmm6 ; [2 6 A E] movaps [rdx+48],xmm2 ; [3 7 B F] ... movaps xmm0,xmm4 ; [0 4 8 C] movaps xmm1,xmm5 ; [1 5 9 D] movaps xmm3,xmm2 ; [3 7 B F] movaps xmm2,xmm6 ; [2 6 A E]
removed xmm6 ( to make it simpler )
movaps xmm4,xmm0 ; [0 1 2 3] movaps xmm5,xmm2 ; [8 9 A B] unpcklps xmm0,xmm1 ; [0 4 1 5] unpcklps xmm5,xmm3 ; [8 C 9 D] unpckhps xmm4,xmm1 ; [2 6 3 7] unpckhps xmm2,xmm3 ; [A E B F] movaps xmm1,xmm0 ; [0 4 1 5] movaps xmm6,xmm4 ; [2 6 3 7] movlhps xmm0,xmm5 ; [0 4 8 C] movlhps xmm6,xmm2 ; [2 6 A E] movhlps xmm5,xmm1 ; [1 5 9 D] movhlps xmm2,xmm4 ; [3 7 B F]; movaps xmm0,xmm4 ; [0 4 8 C] movaps xmm1,xmm5 ; [1 5 9 D] movaps xmm3,xmm2 ; [3 7 B F] movaps xmm2,xmm6 ; [2 6 A E]
vunpckhps xmm4,xmm2,xmm3 vunpcklps xmm2,xmm2,xmm3 vunpckhps xmm3,xmm0,xmm1 vunpcklps xmm1,xmm0,xmm1 vmovlhps xmm0,xmm1,xmm2 vmovhlps xmm1,xmm2,xmm1 vmovlhps xmm2,xmm3,xmm4 vmovhlps xmm3,xmm4,xmm3
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different.
As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.
Simpler and faster..Code: [Select] vunpckhps xmm4,xmm2,xmm3 vunpcklps xmm2,xmm2,xmm3 vunpckhps xmm3,xmm0,xmm1 vunpcklps xmm1,xmm0,xmm1 vmovlhps xmm0,xmm1,xmm2 vmovhlps xmm1,xmm2,xmm1 vmovlhps xmm2,xmm3,xmm4 vmovhlps xmm3,xmm4,xmm3
movaps xmm4,xmm2 unpckhps xmm4,xmm3 unpcklps xmm2,xmm3 movaps xmm3,xmm0 unpckhps xmm3,xmm1 unpcklps xmm0,xmm1 movaps xmm1,xmm2 movhlps xmm1,xmm0 movlhps xmm0,xmm2 movaps xmm2,xmm3 movlhps xmm2,xmm4 movhlps xmm4,xmm3 movaps xmm3,xmm4