The MASM Forum

General => The Laboratory => Topic started by: hutch-- on August 30, 2018, 10:03:20 PM

Title: SSE question.
Post by: hutch-- on August 30, 2018, 10:03:20 PM
I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.

    movsd xmm0, REAL8 PTR [r12]
    addsd xmm0, REAL8 PTR [r12]
  ; -----------------
    movsd xmm0, REAL8 PTR [r12]
    subsd xmm0, REAL8 PTR [r13]
Title: Re: SSE question.
Post by: daydreamer on August 30, 2018, 11:28:33 PM
Quote from: hutch-- on August 30, 2018, 10:03:20 PM
I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.

    movsd xmm0, REAL8 PTR [r12]
    addsd xmm0, REAL8 PTR [r12]
  ; -----------------
    movsd xmm0, REAL8 PTR [r12]
    subsd xmm0, REAL8 PTR [r13]

in a loop,unroll ***SD (Scalar double) to ***PD (packed double) and because there are better chances on newer hardware having many execution units so you can unroll to several or more MULPD's,DIVPD's ,as long as they don't have dependices on previous operations,thats worth testing


Title: Re: SSE question.
Post by: hutch-- on August 31, 2018, 03:13:01 PM
Thanks Magnus,

What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.
Title: Re: SSE question.
Post by: daydreamer on August 31, 2018, 10:36:18 PM
Quote from: hutch-- on August 31, 2018, 03:13:01 PM
What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.
well as long as you dont need more than ADDSD,SUBSD,MULSD,DIVSD,SQRTSD and dont see a need for unroll loop or take advantage of 128bit mov's together with arithmetic calculations it works fine
but if you dont want to use an extern SIMD library to perform fpu operations SSE ,here is an example of great code that makes use of packed instructions
even if you only want a simple program that lets you enter values for a triangle and calculate unknown sides with trigo
http://masm32.com/board/index.php?topic=4118.msg49276#msg49276 (http://masm32.com/board/index.php?topic=4118.msg49276#msg49276)
Title: Re: SSE question.
Post by: hutch-- on September 01, 2018, 12:02:16 AM
Magnus,

You may like this one then, sequential add then average.

  ; ----------------------------------------------
  ; result is in the user specified SSE register
  ; also saves & uses xmm15 for integer conversion
  ; ----------------------------------------------
    sseavrg MACRO ssereg, args:VARARG
      LOCAL cntr
      cntr = 1
      FOR arg,<args>
        IF cntr eq 1
          movsd ssereg, arg
        ELSE
          addsd ssereg, arg
        ENDIF
        cntr = cntr + 1
      ENDM
      cntr = cntr - 1
      IFNDEF xmmbak@@@@@@
        xmmbak@@@@@@ equ <1>
        .data
          align 16
          xmmx XMMWORD 0.0
        .code
      ENDIF
      movdqa xmmx, xmm15         ;; save xmm15
      mov rax, cntr
      cvtsi2sd xmm15, rax
      divsd ssereg, xmm15
      movdqa xmm15, xmmx         ;; restore xmm15
    ENDM
  ; --------------------------------------------

Disassembly.

.text:000000014000101e 488D8530FFFFFF             lea rax, [rbp-0xd0]
.text:0000000140001025 48898550FFFFFF             mov qword ptr [rbp-0xb0], rax
.text:000000014000102c 488D052D100000             lea rax, [0x140002060]
.text:0000000140001033 F20F1000                   movsd xmm0, qword ptr [rax]
.text:0000000140001037 F20F118570FFFFFF           movsd qword ptr [rbp-0x90], xmm0
.text:000000014000103f F20F10A570FFFFFF           movsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001047 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000104f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001057 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000105f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001067 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000106f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001077 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000107f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001087 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000108f 66440F7F3DD80F0000         movdqa xmmword ptr [0x140002070], xmm15
.text:0000000140001098 48C7C00A000000             mov rax, 0xa
.text:000000014000109f F24C0F2AF8                 cvtsi2sd xmm15, rax
.text:00000001400010a4 F2410F5EE7                 divsd xmm4, xmm15
.text:00000001400010a9 66440F6F3DBE0F0000         movdqa xmm15, xmmword ptr [0x140002070]
.text:00000001400010b2 F20F11A570FFFFFF           movsd qword ptr [rbp-0x90], xmm4
Title: Re: SSE question.
Post by: daydreamer on September 01, 2018, 12:57:01 AM
nice macro Hutch

mov rax,years
MOVSD xmm0,capital ;Money in bank
MOVSD xmm1,rent ;rent expressed in percent
MULSD xmm1,reciprocalof100
ADDSD xmm1,one ;for example 10% rent becomes 1.10
LBL:MULSD xmm0,xmm1 ;every year Money gets bigger by 1.10%
sub rax,1 ;loop number of years
jne LBL
MOVSD newcapital,xmm0
Title: Re: SSE question.
Post by: hutch-- on September 01, 2018, 10:17:29 AM
I have given the macro a slight tweak by making the variable name in the data section a macro local so you get a unique ID for every macro call which makes it safer for multithreading.

  ; ----------------------------------------------
  ; result is in the user specified SSE register
  ; also saves & uses xmm15 for integer conversion
  ; ----------------------------------------------
    sseavrg MACRO ssereg, args:VARARG
      LOCAL cntr,xmmx
      .data
        align 16
        xmmx XMMWORD 0.0
      .code
      cntr = 1
      FOR arg,<args>
        IF cntr eq 1
          movsd ssereg, arg
        ELSE
          addsd ssereg, arg
        ENDIF
        cntr = cntr + 1
      ENDM
      cntr = cntr - 1
      movdqa xmmx, xmm15         ;; save xmm15
      mov rax, cntr
      cvtsi2sd xmm15, rax
      divsd ssereg, xmm15
      movdqa xmm15, xmmx         ;; restore xmm15
    ENDM
  ; --------------------------------------------
Title: Re: SSE question.
Post by: johnsa on November 01, 2018, 02:55:49 AM
Hey,

Haven't tried this.. but if you're doing a pair of additions like in your example and you have FMA you could try doing :

movsd xmm0, REAL8 PTR [r12]
addsd xmm0, REAL8 PTR [r12]
; -----------------
movsd xmm1, REAL8 PTR [r12]
movsd xmm2, REAL8 PTR [r13]
vfmadd132sd xmm1,xmm2,REAL8 PTR valueOf1

I'm not sure if it will make any practical difference.. might even be slower, but as i recall addsd can only go through port 1 with 3 cycle latency, so the 2nd addition would be delayed, where-as the FMA can go through 0,1,2,3 with a memory operand, so using a scale factor of 1 basically turns it into an add.
Title: Re: SSE question.
Post by: johnsa on November 01, 2018, 03:24:53 AM
Although I guess by the same reasoning, you could just as well load the two FP64's into low/high lanes and use a single addpd :)
Title: Re: SSE question.
Post by: hutch-- on November 01, 2018, 09:25:13 AM
Thanks John, it makes sense in a streaming task where you would just put variations to the clock to see what is faster but you may get some surprises with hardware variations. In the applications I have used SSE2 in, its the result accuracy that matters, most SSE2 maths I have seen are 32 bit to get the speed up for graphics applications where the 64 bit version seems to be best suited for calculations.