News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Save MMX, SSE, AVX registers to memory

Started by JK, July 26, 2020, 06:02:01 AM

Previous topic - Next topic

hutch--

Here is a quick toy before I have to go back to configuring Win10 on the new box.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12
    LOCAL psrc  :QWORD
    LOCAL pdst  :QWORD
    LOCAL asrc  :QWORD
    LOCAL adst  :QWORD
    LOCAL tcnt  :QWORD

    SaveRegs

    memsize equ <1024*1024*1024*4>

    HighPriority

    mov psrc, alloc(memsize+1024)                       ; 4 gig  + 1k
    alignup rax, 512                                    ; align the memory
    mov asrc, rax                                       ; save address in ptr

    conout "  ptr aligned src ",str$(asrc),lf                                ; display address

    mov pdst, alloc(memsize+1024)
    alignup rax, 512
    mov adst, rax

    conout "  ptr aligned dst ",str$(adst),lf

    rcall GetTickCount
    mov r12, rax

  ; |||||||||||||||||||||||||||||||||||||||||

    rcall aligned_data_copy,asrc,adst,memsize           ; call block copy proc

  ; |||||||||||||||||||||||||||||||||||||||||

    rcall GetTickCount
    sub rax, r12
    mov r12, rax

    conout "  -----------------------",lf
    conout "   4 gig copy in ",str$(r12)," ms",lf       ; show milliseconds
    conout "  -----------------------",lf,lf

    mfree psrc                                          ; free memory
    mfree pdst

    NormalPriority

    waitkey
    RestoreRegs
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

  YMMSTACK

aligned_data_copy proc ;;;; src:QWORD,dst:QWORD,bcnt:QWORD

    shr r8, 8                                           ; div by 256

  @@:
    vmovdqa ymm0, YMMWORD PTR [rcx]
    vmovdqa ymm1, YMMWORD PTR [rcx+32]
    vmovdqa ymm2, YMMWORD PTR [rcx+64]
    vmovdqa ymm3, YMMWORD PTR [rcx+96]

    vmovdqa ymm4, YMMWORD PTR [rcx+128]
    vmovdqa ymm5, YMMWORD PTR [rcx+160]
    vmovdqa ymm6, YMMWORD PTR [rcx+192]
    vmovdqa ymm7, YMMWORD PTR [rcx+224]

    vmovdqa YMMWORD PTR [rdx], ymm0
    vmovdqa YMMWORD PTR [rdx+32], ymm1
    vmovdqa YMMWORD PTR [rdx+64], ymm2
    vmovdqa YMMWORD PTR [rdx+96], ymm3

    vmovdqa YMMWORD PTR [rdx+128], ymm4
    vmovdqa YMMWORD PTR [rdx+160], ymm5
    vmovdqa YMMWORD PTR [rdx+192], ymm6
    vmovdqa YMMWORD PTR [rdx+224], ymm7

    add rcx, 256
    add rdx, 256

    sub r8, 1
    jnz @B

    ret

aligned_data_copy endp

  STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

Result on my Haswell.

  ptr aligned src 5368799744
  ptr aligned dst 9663869440
  -----------------------
   4 gig copy in 2125 ms
  -----------------------

Press any key to continue...


hutch--

As I expected, the unroll did not make it any faster but a different instruction choice did.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

  YMMSTACK

aligned_data_copy proc

  ; src = rcx
  ; dst = rdx
  ; cnt = r8

    shr r8, 5                           ; div by 32

  @@:
    vmovntdqa ymm0, YMMWORD PTR [rcx]
    vmovntdq YMMWORD PTR [rdx], ymm0
    add rcx, 32
    add rdx, 32
    sub r8, 1
    jnz @B

    ret

aligned_data_copy endp

  STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007


nidud

#18
deleted

JK

Thanks for all your input!

QuoteYou may consider this for the registers.

Sometimes i want to have code, which can be assembled for 32 and 64 bit. And i want to be able to see at first glance, that this is such code. "RAX" obviously must be 64 bit. "EAX" could be both (32 and 64 bit) or 32 bit only - it´s ambiguous in this respect. If i name it "CAX" (as i did) i can tell at once, it is common code (C for common) - just a personal preference.


JK

HSE

In ObjAsm , a dual 32/64 framework, is used xax, xcx, xsi, xdi, etc
Equations in Assembly: SmplMath