I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.
movsd xmm0, REAL8 PTR [r12]
addsd xmm0, REAL8 PTR [r12]
; -----------------
movsd xmm0, REAL8 PTR [r12]
subsd xmm0, REAL8 PTR [r13]
Quote from: hutch-- on August 30, 2018, 10:03:20 PM
I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.
movsd xmm0, REAL8 PTR [r12]
addsd xmm0, REAL8 PTR [r12]
; -----------------
movsd xmm0, REAL8 PTR [r12]
subsd xmm0, REAL8 PTR [r13]
in a loop,unroll ***SD (Scalar double) to ***PD (packed double) and because there are better chances on newer hardware having many execution units so you can unroll to several or more MULPD's,DIVPD's ,as long as they don't have dependices on previous operations,thats worth testing
Thanks Magnus,
What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.
Quote from: hutch-- on August 31, 2018, 03:13:01 PM
What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.
well as long as you dont need more than ADDSD,SUBSD,MULSD,DIVSD,SQRTSD and dont see a need for unroll loop or take advantage of 128bit mov's together with arithmetic calculations it works fine
but if you dont want to use an extern SIMD library to perform fpu operations SSE ,here is an example of great code that makes use of packed instructions
even if you only want a simple program that lets you enter values for a triangle and calculate unknown sides with trigo
http://masm32.com/board/index.php?topic=4118.msg49276#msg49276 (http://masm32.com/board/index.php?topic=4118.msg49276#msg49276)
Magnus,
You may like this one then, sequential add then average.
; ----------------------------------------------
; result is in the user specified SSE register
; also saves & uses xmm15 for integer conversion
; ----------------------------------------------
sseavrg MACRO ssereg, args:VARARG
LOCAL cntr
cntr = 1
FOR arg,<args>
IF cntr eq 1
movsd ssereg, arg
ELSE
addsd ssereg, arg
ENDIF
cntr = cntr + 1
ENDM
cntr = cntr - 1
IFNDEF xmmbak@@@@@@
xmmbak@@@@@@ equ <1>
.data
align 16
xmmx XMMWORD 0.0
.code
ENDIF
movdqa xmmx, xmm15 ;; save xmm15
mov rax, cntr
cvtsi2sd xmm15, rax
divsd ssereg, xmm15
movdqa xmm15, xmmx ;; restore xmm15
ENDM
; --------------------------------------------
Disassembly.
.text:000000014000101e 488D8530FFFFFF lea rax, [rbp-0xd0]
.text:0000000140001025 48898550FFFFFF mov qword ptr [rbp-0xb0], rax
.text:000000014000102c 488D052D100000 lea rax, [0x140002060]
.text:0000000140001033 F20F1000 movsd xmm0, qword ptr [rax]
.text:0000000140001037 F20F118570FFFFFF movsd qword ptr [rbp-0x90], xmm0
.text:000000014000103f F20F10A570FFFFFF movsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001047 F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000104f F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001057 F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000105f F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001067 F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000106f F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001077 F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000107f F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001087 F20F58A570FFFFFF addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000108f 66440F7F3DD80F0000 movdqa xmmword ptr [0x140002070], xmm15
.text:0000000140001098 48C7C00A000000 mov rax, 0xa
.text:000000014000109f F24C0F2AF8 cvtsi2sd xmm15, rax
.text:00000001400010a4 F2410F5EE7 divsd xmm4, xmm15
.text:00000001400010a9 66440F6F3DBE0F0000 movdqa xmm15, xmmword ptr [0x140002070]
.text:00000001400010b2 F20F11A570FFFFFF movsd qword ptr [rbp-0x90], xmm4
nice macro Hutch
mov rax,years
MOVSD xmm0,capital ;Money in bank
MOVSD xmm1,rent ;rent expressed in percent
MULSD xmm1,reciprocalof100
ADDSD xmm1,one ;for example 10% rent becomes 1.10
LBL:MULSD xmm0,xmm1 ;every year Money gets bigger by 1.10%
sub rax,1 ;loop number of years
jne LBL
MOVSD newcapital,xmm0
I have given the macro a slight tweak by making the variable name in the data section a macro local so you get a unique ID for every macro call which makes it safer for multithreading.
; ----------------------------------------------
; result is in the user specified SSE register
; also saves & uses xmm15 for integer conversion
; ----------------------------------------------
sseavrg MACRO ssereg, args:VARARG
LOCAL cntr,xmmx
.data
align 16
xmmx XMMWORD 0.0
.code
cntr = 1
FOR arg,<args>
IF cntr eq 1
movsd ssereg, arg
ELSE
addsd ssereg, arg
ENDIF
cntr = cntr + 1
ENDM
cntr = cntr - 1
movdqa xmmx, xmm15 ;; save xmm15
mov rax, cntr
cvtsi2sd xmm15, rax
divsd ssereg, xmm15
movdqa xmm15, xmmx ;; restore xmm15
ENDM
; --------------------------------------------
Hey,
Haven't tried this.. but if you're doing a pair of additions like in your example and you have FMA you could try doing :
movsd xmm0, REAL8 PTR [r12]
addsd xmm0, REAL8 PTR [r12]
; -----------------
movsd xmm1, REAL8 PTR [r12]
movsd xmm2, REAL8 PTR [r13]
vfmadd132sd xmm1,xmm2,REAL8 PTR valueOf1
I'm not sure if it will make any practical difference.. might even be slower, but as i recall addsd can only go through port 1 with 3 cycle latency, so the 2nd addition would be delayed, where-as the FMA can go through 0,1,2,3 with a memory operand, so using a scale factor of 1 basically turns it into an add.
Although I guess by the same reasoning, you could just as well load the two FP64's into low/high lanes and use a single addpd :)
Thanks John, it makes sense in a streaming task where you would just put variations to the clock to see what is faster but you may get some surprises with hardware variations. In the applications I have used SSE2 in, its the result accuracy that matters, most SSE2 maths I have seen are 32 bit to get the speed up for graphics applications where the 64 bit version seems to be best suited for calculations.