News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SSE question.

Started by hutch--, August 30, 2018, 10:03:20 PM

Previous topic - Next topic

hutch--

I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.

    movsd xmm0, REAL8 PTR [r12]
    addsd xmm0, REAL8 PTR [r12]
  ; -----------------
    movsd xmm0, REAL8 PTR [r12]
    subsd xmm0, REAL8 PTR [r13]

daydreamer

Quote from: hutch-- on August 30, 2018, 10:03:20 PM
I am just doing some work with 64 bit FP in SSE and I wondered if there are faster ways doing simple calculations. These work OK and seem to clock fast enough but I thought there may be a faster way on later hardware.

    movsd xmm0, REAL8 PTR [r12]
    addsd xmm0, REAL8 PTR [r12]
  ; -----------------
    movsd xmm0, REAL8 PTR [r12]
    subsd xmm0, REAL8 PTR [r13]

in a loop,unroll ***SD (Scalar double) to ***PD (packed double) and because there are better chances on newer hardware having many execution units so you can unroll to several or more MULPD's,DIVPD's ,as long as they don't have dependices on previous operations,thats worth testing


my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

Thanks Magnus,

What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.

daydreamer

Quote from: hutch-- on August 31, 2018, 03:13:01 PM
What I am doing is a set of arithmetic calculations using single (scalar) values so in this context the packed multiple value instructions are not much use to me. The vector instructions are the right way to go with streamed data for graphics as well as other floating point applications but I cannot see a way to use them for arithmetic calculations.
well as long as you dont need more than ADDSD,SUBSD,MULSD,DIVSD,SQRTSD and dont see a need for unroll loop or take advantage of 128bit mov's together with arithmetic calculations it works fine
but if you dont want to use an extern SIMD library to perform fpu operations SSE ,here is an example of great code that makes use of packed instructions
even if you only want a simple program that lets you enter values for a triangle and calculate unknown sides with trigo
http://masm32.com/board/index.php?topic=4118.msg49276#msg49276
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

Magnus,

You may like this one then, sequential add then average.

  ; ----------------------------------------------
  ; result is in the user specified SSE register
  ; also saves & uses xmm15 for integer conversion
  ; ----------------------------------------------
    sseavrg MACRO ssereg, args:VARARG
      LOCAL cntr
      cntr = 1
      FOR arg,<args>
        IF cntr eq 1
          movsd ssereg, arg
        ELSE
          addsd ssereg, arg
        ENDIF
        cntr = cntr + 1
      ENDM
      cntr = cntr - 1
      IFNDEF xmmbak@@@@@@
        xmmbak@@@@@@ equ <1>
        .data
          align 16
          xmmx XMMWORD 0.0
        .code
      ENDIF
      movdqa xmmx, xmm15         ;; save xmm15
      mov rax, cntr
      cvtsi2sd xmm15, rax
      divsd ssereg, xmm15
      movdqa xmm15, xmmx         ;; restore xmm15
    ENDM
  ; --------------------------------------------

Disassembly.

.text:000000014000101e 488D8530FFFFFF             lea rax, [rbp-0xd0]
.text:0000000140001025 48898550FFFFFF             mov qword ptr [rbp-0xb0], rax
.text:000000014000102c 488D052D100000             lea rax, [0x140002060]
.text:0000000140001033 F20F1000                   movsd xmm0, qword ptr [rax]
.text:0000000140001037 F20F118570FFFFFF           movsd qword ptr [rbp-0x90], xmm0
.text:000000014000103f F20F10A570FFFFFF           movsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001047 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000104f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001057 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000105f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001067 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000106f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001077 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000107f F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:0000000140001087 F20F58A570FFFFFF           addsd xmm4, qword ptr [rbp-0x90]
.text:000000014000108f 66440F7F3DD80F0000         movdqa xmmword ptr [0x140002070], xmm15
.text:0000000140001098 48C7C00A000000             mov rax, 0xa
.text:000000014000109f F24C0F2AF8                 cvtsi2sd xmm15, rax
.text:00000001400010a4 F2410F5EE7                 divsd xmm4, xmm15
.text:00000001400010a9 66440F6F3DBE0F0000         movdqa xmm15, xmmword ptr [0x140002070]
.text:00000001400010b2 F20F11A570FFFFFF           movsd qword ptr [rbp-0x90], xmm4

daydreamer

nice macro Hutch

mov rax,years
MOVSD xmm0,capital ;Money in bank
MOVSD xmm1,rent ;rent expressed in percent
MULSD xmm1,reciprocalof100
ADDSD xmm1,one ;for example 10% rent becomes 1.10
LBL:MULSD xmm0,xmm1 ;every year Money gets bigger by 1.10%
sub rax,1 ;loop number of years
jne LBL
MOVSD newcapital,xmm0
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

I have given the macro a slight tweak by making the variable name in the data section a macro local so you get a unique ID for every macro call which makes it safer for multithreading.

  ; ----------------------------------------------
  ; result is in the user specified SSE register
  ; also saves & uses xmm15 for integer conversion
  ; ----------------------------------------------
    sseavrg MACRO ssereg, args:VARARG
      LOCAL cntr,xmmx
      .data
        align 16
        xmmx XMMWORD 0.0
      .code
      cntr = 1
      FOR arg,<args>
        IF cntr eq 1
          movsd ssereg, arg
        ELSE
          addsd ssereg, arg
        ENDIF
        cntr = cntr + 1
      ENDM
      cntr = cntr - 1
      movdqa xmmx, xmm15         ;; save xmm15
      mov rax, cntr
      cvtsi2sd xmm15, rax
      divsd ssereg, xmm15
      movdqa xmm15, xmmx         ;; restore xmm15
    ENDM
  ; --------------------------------------------

johnsa

Hey,

Haven't tried this.. but if you're doing a pair of additions like in your example and you have FMA you could try doing :

movsd xmm0, REAL8 PTR [r12]
addsd xmm0, REAL8 PTR [r12]
; -----------------
movsd xmm1, REAL8 PTR [r12]
movsd xmm2, REAL8 PTR [r13]
vfmadd132sd xmm1,xmm2,REAL8 PTR valueOf1

I'm not sure if it will make any practical difference.. might even be slower, but as i recall addsd can only go through port 1 with 3 cycle latency, so the 2nd addition would be delayed, where-as the FMA can go through 0,1,2,3 with a memory operand, so using a scale factor of 1 basically turns it into an add.

johnsa

Although I guess by the same reasoning, you could just as well load the two FP64's into low/high lanes and use a single addpd :)

hutch--

Thanks John, it makes sense in a streaming task where you would just put variations to the clock to see what is faster but you may get some surprises with hardware variations. In the applications I have used SSE2 in, its the result accuracy that matters, most SSE2 maths I have seen are 32 bit to get the speed up for graphics applications where the 64 bit version seems to be best suited for calculations.