Save MMX, SSE, AVX registers to memory

hutch-- · July 27, 2020, 03:06:49 PM

Here is a quick toy before I have to go back to configuring Win10 on the new box.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

USING r12
LOCAL psrc :QWORD
LOCAL pdst :QWORD
LOCAL asrc :QWORD
LOCAL adst :QWORD
LOCAL tcnt :QWORD

SaveRegs

memsize equ <1024*1024*1024*4>

HighPriority

mov psrc, alloc(memsize+1024) ; 4 gig + 1k
alignup rax, 512 ; align the memory
mov asrc, rax ; save address in ptr

conout " ptr aligned src ",str$(asrc),lf ; display address

mov pdst, alloc(memsize+1024)
alignup rax, 512
mov adst, rax

conout " ptr aligned dst ",str$(adst),lf

rcall GetTickCount
mov r12, rax

; |||||||||||||||||||||||||||||||||||||||||

rcall aligned_data_copy,asrc,adst,memsize ; call block copy proc

; |||||||||||||||||||||||||||||||||||||||||

rcall GetTickCount
sub rax, r12
mov r12, rax

conout " -----------------------",lf
conout " 4 gig copy in ",str$(r12)," ms",lf ; show milliseconds
conout " -----------------------",lf,lf

mfree psrc ; free memory
mfree pdst

NormalPriority

waitkey
RestoreRegs
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

YMMSTACK

aligned_data_copy proc ;;;; src:QWORD,dst:QWORD,bcnt:QWORD

shr r8, 8 ; div by 256

@@:
vmovdqa ymm0, YMMWORD PTR [rcx]
vmovdqa ymm1, YMMWORD PTR [rcx+32]
vmovdqa ymm2, YMMWORD PTR [rcx+64]
vmovdqa ymm3, YMMWORD PTR [rcx+96]

vmovdqa ymm4, YMMWORD PTR [rcx+128]
vmovdqa ymm5, YMMWORD PTR [rcx+160]
vmovdqa ymm6, YMMWORD PTR [rcx+192]
vmovdqa ymm7, YMMWORD PTR [rcx+224]

vmovdqa YMMWORD PTR [rdx], ymm0
vmovdqa YMMWORD PTR [rdx+32], ymm1
vmovdqa YMMWORD PTR [rdx+64], ymm2
vmovdqa YMMWORD PTR [rdx+96], ymm3

vmovdqa YMMWORD PTR [rdx+128], ymm4
vmovdqa YMMWORD PTR [rdx+160], ymm5
vmovdqa YMMWORD PTR [rdx+192], ymm6
vmovdqa YMMWORD PTR [rdx+224], ymm7

add rcx, 256
add rdx, 256

sub r8, 1
jnz @B

ret

aligned_data_copy endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

Result on my Haswell.

ptr aligned src 5368799744
ptr aligned dst 9663869440
-----------------------
4 gig copy in 2125 ms
-----------------------

Press any key to continue...

hutch-- · July 27, 2020, 06:08:05 PM

As I expected, the unroll did not make it any faster but a different instruction choice did.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

YMMSTACK

aligned_data_copy proc

; src = rcx
; dst = rdx
; cnt = r8

shr r8, 5 ; div by 32

@@:
vmovntdqa ymm0, YMMWORD PTR [rcx]
vmovntdq YMMWORD PTR [rdx], ymm0
add rcx, 32
add rdx, 32
sub r8, 1
jnz @B

ret

aligned_data_copy endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007 · July 27, 2020, 06:50:16 PM

A propos: has anybody tried xsave/xrstor?

nidud · July 27, 2020, 11:35:08 PM

deleted

JK · July 28, 2020, 09:07:48 PM

Thanks for all your input!

QuoteYou may consider this for the registers.

Sometimes i want to have code, which can be assembled for 32 and 64 bit. And i want to be able to see at first glance, that this is such code. "RAX" obviously must be 64 bit. "EAX" could be both (32 and 64 bit) or 32 bit only - it´s ambiguous in this respect. If i name it "CAX" (as i did) i can tell at once, it is common code (C for common) - just a personal preference.

JK

HSE · July 29, 2020, 12:30:33 AM

In ObjAsm , a dual 32/64 framework, is used xax, xcx, xsi, xdi, etc

The MASM Forum

News:

Save MMX, SSE, AVX registers to memory

hutch--

hutch--

jj2007

nidud

JK

HSE