Hi Guys
How do i swap the position of DoubleDword from 2 different SSE registers using SSE2 ?
Say i have this:
[DataShuffle1: R$ 1, 0]
[DataShuffle2: R$ 3, 2]
movupd xmm0 X$DataShuffle1
movupd xmm1 X$DataShuffle2
So, the positionjs of the Real8 in the xmm registers will become.
How do i switch the values of 1 and 2, so the new positions will turn onto:
I tried to use pshufd, pxor etc, but couldn´t suceed to make it switch positions from the Lower quadword on xmm0 to the Higher quadword in xmm1 and vice-versa. Btw, swap using only those 2 registers xmm0 / xmm1
PSHUFD is the right instruction. A swap with two xmm regs is possible only if you use memory.
MOVLHPS (https://www.felixcloutier.com/x86/movlhps)
Hi Guys
Tks, JJ and Nidud. I guess it worked :thumbsup:
But, as JJ said, i needed to add an extra register (or memory pointer). What i did was:
; Y ------ (a^2-b^2)*Y*Atan2_FactorA9 <--- xmm0 -9.05387052965142591e-4 0.000321690515963660872
; Y^4 ------ c^2 <--- xmm1 6.719502e-13 2.803170505524641028443
movhlps xmm2 xmm1
movlhps xmm2 xmm0
movsd xmm0 xmm2
pshufd xmm2 xmm2 SSE_SWAP_QWORDS ; SSE_SWAP_QWORDS = 78
movlhps xmm1 xmm2
I thought it could be done with less instructions, but it took 5 to go. But, at least it is working. I´ll try to finish the routine for this usage (The one from atan2), before do further tests and see if it is working
deleted
.686p
.model flat, stdcall
.xmm
.code
Null1 OWORD 22222222222222223333333333333333h
Two3 OWORD 00000000000000001111111111111111h
start:
movups xmm1, Null1
movups xmm0, Two3
movaps xmm2, xmm0
int 3
movhlps xmm0, xmm1 ; 0 2
movlhps xmm1, xmm2 ; 1 3
ret
end start
Great !!! :thumbsup: :thumbsup: :thumbsup: :thumbsup:
Tks, JJ and Nidud. This version works fine and also works on both ways. Hi to Low and Low to Hi swap :azn: :azn:
I´ll check it on the routine for atan2 and review the math before go further and test for speed again. I hope that with the changes, it could be possible to speed up a bit more so it can result on a timing of less then 5000 clocks. :smiley:
I´m using a variation of JJ´s binary search and it only perform around 6-8 iterations before finding the correct pointer to the table, which seems faster then the previous version, but didn´t tested yet. I´m amazed how the changes i made on ucrtbase could result on a gain of speed of more then 12 times. The organization of atan2 was a true mess. i simply don't understand how M$ didn´t saw that when releasing a version of atan2 that was presumed to be fast. For what i saw, it seems to be a very old routine that was adapted to work on windows.