According to Sysinternals "coreinfo" utility my CPU supports AVX, the debugger shows YMM0 to YMM7 (32 bit) and YMM0 to Ymm15 (64 bit) are available. Why does the following code fail at saving YMM0 to memory? The code assembles, links and runs in 32 and 64 bit, but i keep getting a GPF when saving YMM0 (exception_illegal_instruction).
ifdef _WIN64
cax equ <rax>
cbx equ <rbx>
ccx equ <rcx>
cdx equ <rdx>
csi equ <rsi>
cdi equ <rdi>
csp equ <rsp>
cbp equ <rbp>
else
cax equ <eax>
cbx equ <ebx>
ccx equ <ecx>
cdx equ <edx>
csi equ <esi>
cdi equ <edi>
csp equ <esp>
cbp equ <ebp>
endif
ifdef _WIN64
option win64:15
OPTION STACKBASE : RBP
else
.686P
.model flat, stdcall ; 32 bit memory model
.xmm
OPTION STACKBASE : EBP
endif
;assemble console 32
;assemble console 64
include <windows.inc>
includelib kernel32.lib
.data
mem __m256 <>
.code
start proc
;*************************************************************************************
; mmx, sse, avx storage test
;*************************************************************************************
local ttt:__m256
lea cax, mem
; lea cax, ttt ;doesn´t work either
movq [cax], mm0 ;mmx
movdqu [cax], xmm0 ;xmm
vmovdqu [cax], ymm0 ;ymm -> gpf (exception_illegal_instruction)
; vmovdqu32 [cax], ymm0 ;ymm -> same here
invoke ExitProcess, 0
ret
start endp
end start
Thanks
JK
I can't really help you with UASM but the obvious would be to try each technique separately to see if it works correctly then test the macro to see if the problem is there.
I have never used AVX, but i want to learn about it. So lesson #1 for me: how to save and load YMM0. The posted code obviously doesn´t make much sense, but it demonstrates my problem: i can save MM0 (MMx Registers), i can save XMM0 (XMMx Registers), but i cannot save YMM0 (YMMx registers).
Running the code fails at "vmovdqu [rax], ymm0" with a GPF (exception_illegal_instruction).
So, what is wrong?
- reading the Intel docs the instruction (vmovdqu) seems to be correct
- the assembler (UASM in this case, but should be the same for MASM) accepts it
- the debugger disassemly shows, what i coded (vmovdqu [rax], ymm0)
- the exception (exception_illegal_instruction) means, my CPU doesn´t know this instruction (why?)
- on the other hand "coreinfo" (Sysinternals proved to be a reliable source for useful tools) tells me, that my CPU supports AVX
- the debugger (x86dbg) shows YMMx registers in the registers area
Questions:
- is this code "vmovdqu [rax], ymm0" for saving YMM0 to a memory location pointed to by RAX correct? (MASM or UASM shouldn´t be different)
- did i miss assembler (MASM, UASM) or linker (MS link.exe) switches, which are necessary for AVX to work on Windows 10?
- could it be, that my laptop CPU (Intel(R) Pentium(R) CPU 4417U @ 2.30GHz) doesn´t support AXV (despite of what i said above) and how else could i test it?
JK
Works just fine with UAsm and AsmC, in 32- and 64-bit mode, but ML64 chokes:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm and AsmC
.DATA
dd 123
MyMem OWORD 1234567890ABCDEF1234567890ABCDEFh
OWORD 1234567890ABCDEF1234567890ABCDEFh
OWORD 1234567890ABCDEF1234567890ABCDEFh
OWORD 1234567890ABCDEF1234567890ABCDEFh
.CODE
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
mov rax, offset MyMem
int 3
vmovdqu ymm0, [rax]
movq [rax], mm0 ;mmx
movdqu [rax], xmm0 ;xmm
vmovdqu [rax], ymm0 ;ymm -> gpf (exception_illegal_instruction)
Inkey "all is fine"
EndOfCode
vmovdqu YMMWORD ptr[rax], ymm0
Try this with ML and UAsm/AsmC:
mov rax, offset MyMem
vmovdqu YMMWORD ptr ymm0, [rax] ; Masm doesn't like it
vmovdqu YMMWORD ptr [rax], ymm0
:biggrin:
So you have managed to choke MASM again ? :tongue:
Thanks for your help! "Ymmword ptr" shouldn´t be necessary, because the 2. operand (YMM0) is a register of known size - it doesn´t do any harm on the other side.
@Jochen
running your exe makes the debugger pop up, just like with my code. So we can take two items from the list. The code is ok, it must be something else, probably my CPU or my system don´t support AVX, despite of other information.
Windows 10 does support AVX, so how could i know for sure, if my CPU supports it?
JK
Quote from: JK on July 26, 2020, 09:21:16 PM
Thanks for your help! "Ymmword ptr" shouldn´t be necessary, because the 2. operand (YMM0) is a register of known size - it doesn´t do any harm on the other side.
JK
ML64.exe disagrees with you, it wants to know the data type for the destination operand. :biggrin:
Ok- good to know, thanks!
Quote from: hutch-- on July 26, 2020, 09:13:28 PM
:biggrin:
So you have managed to choke MASM again ? :tongue:
Does
vmovdqu YMMWORD ptr ymm0, [rax] work with your version ol ML64.exe? If yes, which version do you have?
Quote from: JK on July 26, 2020, 09:21:16 PM
Thanks for your help! "Ymmword ptr" shouldn´t be necessary, because the 2. operand (YMM0) is a register of known size
Agreed. Indeed, UAsm and AsmC don't need the redundant size info.
:biggrin:
vmovdqu ymm0, YMMWORD PTR [rax]
You have to remember, MASM is an assembler, its not trying to be a compiler. :tongue:
Quote from: hutch-- on July 27, 2020, 01:12:57 AM
:biggrin:
vmovdqu ymm0, YMMWORD PTR [rax]
You have to remember, MASM is an assembler, its not trying to be a compiler. :tongue:
Congrats, you got it :biggrin:
Perhaps you can help me with another problem:
.DATA
align 64
buffer LABEL byte
ORG $+30000-1
db ?
.CODE
mov rax, offset buffer
fxsave YMMWORD ptr [rax] ; save FPU and xmm regs (but not ymm)
That works like a charm with UAsm and AsmC but ML says "error A2189:invalid combination with segment alignment : 64" :sad:
fxsave
I think it must be a 64 bit pointer, pitty YMM is not in the list.
:biggrin:
Don't stretch it too much, I am already suffering from hardware diagnostic burnout followed by win10 configuration burnout and I have just finished dragging through sending a defective mother board back to Shenzhen in China to a vendor who did their best to make it difficult to do.
RE : The original post, I would set up an aligned procedure, allocate locals of the right size and copy the data to them.
If you do these things the right way, you won't have to keep finding ways to choke MASM, its so "friendly" it will tell you when it chokes itself on 0BADCODEh.
Here is a quick toy before I have to go back to configuring Win10 on the new box.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r12
LOCAL psrc :QWORD
LOCAL pdst :QWORD
LOCAL asrc :QWORD
LOCAL adst :QWORD
LOCAL tcnt :QWORD
SaveRegs
memsize equ <1024*1024*1024*4>
HighPriority
mov psrc, alloc(memsize+1024) ; 4 gig + 1k
alignup rax, 512 ; align the memory
mov asrc, rax ; save address in ptr
conout " ptr aligned src ",str$(asrc),lf ; display address
mov pdst, alloc(memsize+1024)
alignup rax, 512
mov adst, rax
conout " ptr aligned dst ",str$(adst),lf
rcall GetTickCount
mov r12, rax
; |||||||||||||||||||||||||||||||||||||||||
rcall aligned_data_copy,asrc,adst,memsize ; call block copy proc
; |||||||||||||||||||||||||||||||||||||||||
rcall GetTickCount
sub rax, r12
mov r12, rax
conout " -----------------------",lf
conout " 4 gig copy in ",str$(r12)," ms",lf ; show milliseconds
conout " -----------------------",lf,lf
mfree psrc ; free memory
mfree pdst
NormalPriority
waitkey
RestoreRegs
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
YMMSTACK
aligned_data_copy proc ;;;; src:QWORD,dst:QWORD,bcnt:QWORD
shr r8, 8 ; div by 256
@@:
vmovdqa ymm0, YMMWORD PTR [rcx]
vmovdqa ymm1, YMMWORD PTR [rcx+32]
vmovdqa ymm2, YMMWORD PTR [rcx+64]
vmovdqa ymm3, YMMWORD PTR [rcx+96]
vmovdqa ymm4, YMMWORD PTR [rcx+128]
vmovdqa ymm5, YMMWORD PTR [rcx+160]
vmovdqa ymm6, YMMWORD PTR [rcx+192]
vmovdqa ymm7, YMMWORD PTR [rcx+224]
vmovdqa YMMWORD PTR [rdx], ymm0
vmovdqa YMMWORD PTR [rdx+32], ymm1
vmovdqa YMMWORD PTR [rdx+64], ymm2
vmovdqa YMMWORD PTR [rdx+96], ymm3
vmovdqa YMMWORD PTR [rdx+128], ymm4
vmovdqa YMMWORD PTR [rdx+160], ymm5
vmovdqa YMMWORD PTR [rdx+192], ymm6
vmovdqa YMMWORD PTR [rdx+224], ymm7
add rcx, 256
add rdx, 256
sub r8, 1
jnz @B
ret
aligned_data_copy endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Result on my Haswell.
ptr aligned src 5368799744
ptr aligned dst 9663869440
-----------------------
4 gig copy in 2125 ms
-----------------------
Press any key to continue...
As I expected, the unroll did not make it any faster but a different instruction choice did.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
YMMSTACK
aligned_data_copy proc
; src = rcx
; dst = rdx
; cnt = r8
shr r8, 5 ; div by 32
@@:
vmovntdqa ymm0, YMMWORD PTR [rcx]
vmovntdq YMMWORD PTR [rdx], ymm0
add rcx, 32
add rdx, 32
sub r8, 1
jnz @B
ret
aligned_data_copy endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
A propos: has anybody tried xsave/xrstor (https://studfile.net/preview/1583057/page:15/)?
deleted
Thanks for all your input!
QuoteYou may consider this for the registers.
Sometimes i want to have code, which can be assembled for 32 and 64 bit.
And i want to be able to see at first glance, that this is such code. "RAX" obviously must be 64 bit. "EAX" could be both (32 and 64 bit) or 32 bit only - it´s ambiguous in this respect. If i name it "CAX" (as i did) i can tell at once, it
is common code (C for common) - just a personal preference.
JK
In ObjAsm (http://masm32.com/board/index.php?board=43.0), a dual 32/64 framework, is used xax, xcx, xsi, xdi, etc