I was writing a 64-bit Windows program and noticed something strange:
The following program can be successfully assembled using ML64, but will appear when using UASM64
"Error A2031: invalid addressing mode with current CPU setting"
--------------------------------------------------
array DD 20h DUP (1)
... ...
mov rcx,2
mov eax,array[rcx*4+8]
--------------------------------------------------
Does anyone know what's going on?
Hi,wanker742126
lea rdx,array ;mov rdx,offset array
mov rcx,2
mov eax,[rdx+rcx*4]
rdx: the frist address of your array
rcx: index
4: 1"DD"=4"BYTE"
Hello Six_L
I know it can be written like this, but I would like to know why ML64 can be successfully assembled, but UASM64 cannot. Is it necessary to make any settings on the CPU?
Quote from: wanker742126 on September 04, 2024, 10:17:47 PMI know it can be written like this, but I would like to know why ML64 can be successfully assembled, but UASM64 cannot. Is it necessary to make any settings on the CPU?
No. It's valid syntax, and if UASM refuses it, it's a bug.
Hi,wanker742126
Quotewhy ML64 can be successfully assembled, but UASM64 cannot. Is it necessary to make any settings on the CPU?
1,ML64 is written by microsoft; UASM64 is written by John Hankinson and Branislav Habus, _japheth (our forum member). they are different assembler.
2,Assembly language syntax of ML64 is not ml32-compatible, don't support HLL features. But UASM64 does.
3,ML64 only works on MS OS; UASM64 can work on MS OS, Linux and OSX.
Thanks for your answer, six_L and _japheth.
Maybe I have to wait until UASM supports this syntax. Or maybe try using macros.
Hi,wanker742126
QuoteMaybe I have to wait until UASM supports this syntax.
Maybe You will never be the outcome. because the register addressing is faster than the memory addressing.
;this is a simple workaround, tested in linux
;I was trying to create a macro but displacement have not been added to array offset
;so I put displacement in a register
mov rcx,2
mov edx,8
mov eax,array[ecx*sizeof dword+edx]
;+8 displacement is not being added
mov eax,array[rcx*sizeof dword+8+rdx] ;<---
USE32
mov eax,dword ptr [array+8+ecx*4] ;<---
USE64
Today I found that ADD, SUB, MUL, and DIV can all accept array usage, but only MOV cannot.
The following program can be successfully assembled using UASM64
OPTION CASEMAP:NONE
OPTION WIN64:3
INCLUDELIB KERNEL32.LIB
INCLUDELIB USER32.LIB
MessageBoxA PROTO :QWORD,:QWORD,:QWORD,:DWORD
ExitProcess PROTO :DWORD
;*********************************************************************
.CONST
array DD 20h DUP (1)
szCaption DB "TITLE",0
szText DB "64-BIT ASSEMBLY.",0
;*********************************************************************
.CODE
;---------------------------------------------------------------------
main PROC
mov rcx,2
add eax,array[rcx*4+8]
sub eax,array[rcx*4+8]
xor eax,array[rcx*4+8]
mul array[rcx*4+8]
div array[rcx*4+8]
invoke MessageBoxA,0,OFFSET szText,OFFSET szCaption,0
invoke ExitProcess,0
main ENDP
;*********************************************************************
END main
But the program below will fail to assemble and compile, and the error occurs in the MOV line.
OPTION CASEMAP:NONE
OPTION WIN64:3
INCLUDELIB KERNEL32.LIB
INCLUDELIB USER32.LIB
MessageBoxA PROTO :QWORD,:QWORD,:QWORD,:DWORD
ExitProcess PROTO :DWORD
;*********************************************************************
.CONST
array DD 20h DUP (1)
szCaption DB "TITLE",0
szText DB "64-BIT ASSEMBLY.",0
;*********************************************************************
.CODE
;---------------------------------------------------------------------
main PROC
mov rcx,2
mov eax,array[rcx*4+8]
add eax,array[rcx*4+8]
sub eax,array[rcx*4+8]
xor eax,array[rcx*4+8]
mul array[rcx*4+8]
div array[rcx*4+8]
invoke MessageBoxA,0,OFFSET szText,OFFSET szCaption,0
invoke ExitProcess,0
main ENDP
;*********************************************************************
END main
you are digging the bugs of compiler, not learning the assembler language.
Come on!
the TAIWAN's frog is sitting on well and talking the vastness of sky.
I wrote a set of macros that can somewhat resolve the issue of assembling instructions like mov eax, array[rcx*4+8] using UASM64.EXE. The principle is quite simple: analyze such instructions and convert them into the corresponding machine code. However, the implementation is rather complicated. Nonetheless, I managed to complete it.
Below is a demonstration program:
asm
OPTION CASEMAP:NONE
OPTION WIN64:3
INCLUDELIB KERNEL32.LIB
INCLUDELIB USER32.LIB
EXTRN MessageBoxA:PROC
EXTRN ExitProcess:PROC
INCLUDE MYFIX_ENG.INC
;*********************************************************
.DATA
button DD 10h,20h,30h,40h
n DD 4
szCaption DB "TEST MYFIX_ENG.INC",0
szText DB "PLEASE PRESS ANY BUTTON.",0
;*********************************************************
.CODE
;---------------------------------------------------------
main PROC
again: xor rcx,rcx
mov rdx,OFFSET szText
lea r8,szCaption
movzx rax,WORD PTR n
mova r9d,button[rax*4]
call MessageBoxA
dec n
jnz again
xor ecx,ecx
call ExitProcess
main ENDP
;*********************************************************
END
The mova macro used here is defined in MYFIX_ENG.INC. I've tested it, and it works fine, but there may still be oversights. If you encounter any issues, please let me know on the forum or via email: wkochih@gmail.com. Since MYFIX_ENG.INC has over 400 lines, I've compressed it into an archive.Here is MYFIX_ENG (https://mega.nz/file/sM1yXQIB#JcAJMwC5IMSLU9j2VONeRry0VukWByS5fmwhk5Jw4lU)
Hello,
Your MYFIX_ENG.INC file is pretty large. Are you sure it works with any image base? I'm asking because I vaguely remember that RIP-relative addressing is limited to direct addressing - that means, if indirect addressing is used ( based or indexed or both ), the displacement ( = offset ) is always "zero-based", never RIP-relative. As a consequence, base/indexed addressing with displacements fully work in the first 4 GB of the address space only. However, it's long ago that I examined this stuff, maybe I'm wrong...
My MYFIX_ENG.INC can only be used in
MOV reg1,VAR[reg2*n]
MOV VAR[reg1*n],reg2
MOV Var[reg*n],imm
In three cases, reg1, reg2, and reg can only be general purposes registers, cannot be RIP. In fact, RIP cannot change its value through MOV. In addition, I also admit that MYFIX_ENG.INC is very large, with more than 400 lines, so I did not list the entire code on the forum, but downloaded its link through MEGA. Maybe you experts can simplify it :).
Quote from: wanker742126 on September 14, 2024, 10:31:34 AMMy MYFIX_ENG.INC can only be used in
MOV reg1,VAR[reg2*n]
Ok, I did a little test:
.x64
.model flat
include MYFIX_ENG.INC
.data
db 50h dup (?)
array dd 0,1,2,3,4,5,6,7
.code
start:
xor ecx, ecx
mova eax, array[rcx*4]
mov eax, array[rcx*4]
ret
end start
Since UASM refuses indexed addressing with symbolic displacement, I used JWasm to assemble it. DUMPBIN then displays the following relocations:
Offset Type Applied To Index Name
-------- ---------------- ----------------- -------- ------
00000005 ADDR32 00000050 4 .data
0000000C ADDR32 00000000 9 array
So there's indeed a difference, but the important thing is that both relocs are ADDR32. This inevitably makes MS link complain:
Microsoft (R) Incremental Linker Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.
MOV64b.obj : error LNK2017: 'ADDR32' relocation to '.data' invalid without /LARGEADDRESSAWARE:NO
MOV64b.obj : error LNK2017: 'ADDR32' relocation to 'array' invalid without /LARGEADDRESSAWARE:NO
LINK90 : fatal error LNK1165: link failed because of fixup errors
[/code]
Of course, adding /LARGEADDRESSAWARE:NO to link's cmdline makes the errors disappear, but the image is then restricted to run in the first 4 GB of the address space.
In fact, I just hope that UASM64.EXE can assembly the following three situations:
MOV reg1,VAR[reg2*n]
MOV VAR[reg1*n],reg2
MOV Var[reg*n],imm
I didn't expect to be able to use the entire 64-bit addressing space. So I have always misunderstood your question.
You are wanting something that simply isn't possible.
You're either violating PE and address-space constraints or trying to hack in 32bit code-gen into 64bit mode.
var[base+idx*scale] addressing is fully supported in 32bit mode.
it should NEVER be used in 64bit code as even when the relocation is generated it won't be compatible with PE without adding LARGEADDRESSAWARE:NO - which is a terrible idea for many reasons.
You won't find this addressing form generated by any 64bit compiler (ie. C/C++) for this reason. It also prevents the addressing from being fully RIP relative. The accepted practice in 64bit code is to use LEA and then register_base+idx*scale.
lea rdi, myArray ; RIP relative.
mov eax, [rdi+rcx*4]
I believe NASM can generate RIP relative references like:
mov eax, [rel myArray + rcx*4]
but I'm not sure if that actually solves the L.A.W. problem as it's simply not encodable in x64. I think the only form that would work is with a fixed constant like:
mov eax, [rel myArray + 10] as the ONLY addressing mode with RIP that x64 can handle is [RIP + ofs]
See:
https://stackoverflow.com/questions/48124293/can-rip-be-used-with-another-register-with-rip-relative-addressing (https://stackoverflow.com/questions/48124293/can-rip-be-used-with-another-register-with-rip-relative-addressing)
https://stackoverflow.com/questions/34058101/referencing-the-contents-of-a-memory-location-x86-addressing-modes (https://stackoverflow.com/questions/34058101/referencing-the-contents-of-a-memory-location-x86-addressing-modes)
Quotevar[base+idx*scale] addressing is fully supported in 32bit mode.
it should NEVER be used in 64bit code as even when the relocation is generated it won't be compatible with PE without adding LARGEADDRESSAWARE:NO - which is a terrible idea for many reasons.
I have the following instruction in my 64-bit heapsort implementation (https://lucho.ddns.net/x64uasm.eng/heapSort.s), tested and working in both ELF (Unix-like) and PE (Windows) variants without problems:
CMP R8,[RDI+8*R10+8]
which both GAS and UASM encode as
4E 3B 44 D7 08
Here, of course, RDI is the base, R10 is the index, 8 is the scale, and the other 8 is the offset (or displacement - don't know which is the correct term here).
Yep that form is fine,
[base+idx*scale+constantofs]
the problem is that if you want to reference a symbol, like an array, then you can't use a constant in 64bit, it needs to be RIP relative which you can't encode - unless you use LAA=NO so that the symbols relative distance is restricted in the address space. Also, you can't omit the base register, so it needs a pair.
My "Hello, world!" for Windows (https://lucho.ddns.net/x64uasm.eng/hellow6u.s) includes the following instruction:
LEA R9,WRITTEN
which "UASM -win64 -q -mf -Fl -Sa -zcw -Zd -Zi8" and LINK encode as
4c 8d 0d d3 3f 00 00
which machine code Cygwin's "objdump -d" disassembles as
lea 0x3fd3(%rip),%r9
Whether I give the linker /LARGEADDRESSAWARE or /LARGEADDRESSAWARE:NO, it produces the above machine code.
By the way, what's the real disadvantage of being "large address unaware" besides the 4 GB limit for a single process?
Quote from: lucho on March 29, 2025, 07:46:27 PMWhether I give the linker /LARGEADDRESSAWARE or /LARGEADDRESSAWARE:NO, it produces the above machine code.
By the way, what's the real disadvantage of being "large address unaware" besides the 4 GB limit for a single process?
with 64-bit code bare
/LARGEADDRESSAWARE is useless, as it is default for linker for 64-bit exe
Quote from: lucho on March 29, 2025, 07:46:27 PMwhat's the real disadvantage of being "large address unaware" besides the 4 GB limit for a single process?
Having more than 4GB of address space is the
only valid argument for not sticking with 32-bit code.
Quote from: jj2007 on March 29, 2025, 10:51:30 PMHaving more than 4GB of address space is the only valid argument for not sticking with 32-bit code.
More registers?
I can count the occasions when I ran out of registers on the fingers of one hand. And even then it wasn't a tight loop.
Having 16 instead of 8 integer registers and the lack of the 4 GB RAM limit are very important.
The 4 GB limit for a single process (or maybe a single thread, I don't know) is not so important.
Show me one piece of your software where eight registers were not enough. And another one where 4GB of address space were not enough.
Quote from: jj2007 on March 29, 2025, 10:51:30 PMQuote from: lucho on March 29, 2025, 07:46:27 PMwhat's the real disadvantage of being "large address unaware" besides the 4 GB limit for a single process?
Having more than 4GB of address space is the only valid argument for not sticking with 32-bit code.
Couldn“t agree more. :thumbsup:
Quote from: jj2007 on April 01, 2025, 11:47:11 PMShow me one piece of your software where eight registers were not enough.
Here you are:
; Multiply the 64-bit unsigned numbers in registers EDX:ECX (Y1:Y0) and EBX:EAX
; (Z1:Z0) and return the 128-bit result in registers EDX:ECX:EBX:EAX (A:B:C:D).
; Algorithm: Peter Norton, "Advanced Assembly Language", 1991, pp. 229-230
.code
_um64x64 proc uses ESI EDI EBP
MOV ESI,EAX ; Z0
MOV EDI,EDX ; Y1
PUSH EDX ; save Y1
MUL ECX ; Z0 * Y0
XCHG EAX,ESI ; ESI = D = Low (Z0 * Y0), EAX = Z0
XCHG EDX,EDI ; EDI = C = High(Z0 * Y0), EDX = Y1
MUL EDX ; Z0 * Y1, high word is at most 0xFFFFFFFE
XOR EBP,EBP ; A = 0
ADD EDI,EAX ; C = High(Z0 * Y0) + Low(Z0 * Y1)
ADC EDX,EBP ; High(Z0 * Y1), the sum is at most 0xFFFFFFFF, CF=0
XCHG EDX,ECX ; ECX = B = High(Z0 * Y1), EDX = Y0
MOV EAX,EBX ; Z1
MUL EDX ; Z1 * Y0
ADD EDI,EAX ; C = High(Z0 * Y0) + Low (Z0 * Y1) + Low(Z1 * Y0)
ADC ECX,EDX ; B = High(Z0 * Y1) + High(Z1 * Y0)
ADC EBP,EBP ; A
XCHG EAX,EDI ; EAX = C
XCHG EAX,EBX ; EAX = Z1, EBX = C
POP EDX ; restore Y1
MUL EDX ; Z1 * Y1
ADD ECX,EAX ; B = High(Z0 * Y1) + High(Z1 * Y0) + High(Z1 * Y1)
ADC EDX,EBP ; A = High(Z1 * Y1)
XCHG EAX,ESI ; EAX = D
RET
_um64x64 endp
end
Compare the above with the elegant 64-bit implementation (https://lucho.ddns.net/x64uasm.eng/umul128.s) of the same algorithm.
QuoteAnd another one where 4GB of address space were not enough.
When I claim that 4 GB of address space is not enough, I mean that the
total RAM usage of today's software that resides at the same time in memory exceeds 4 GB, not that 4 GB are not enough for a single process or thread, which is rarely so indeed.
Ok, you won: if I'll ever need a 128-bit result in four dword registers, I'll switch to 64-bit code for that particular task :thumbsup:
With dramatically reduced specifications for the required accuracy, I usually find a simple solution:
include \masm32\MasmBasic\MasmBasic.inc
num64_1 QWORD 1111111111111111111
num64_2 QWORD 2222222222222222222
.DATA?
result64 REAL10 ?
Init
fild num64_1
fild num64_2
deb 4, "On FPU", ST(0), ST(1)
fmul
fstp result64
Print Str$("\nThe result is %Jf", result64)
EndOfCode
On FPU
ST(0) 2222222222222222222.
ST(1) 1111111111111111111.
The result is 2.469135802469135802e+36
Which is, of course, much less accurate than the 2,4691358024691358019753086419753e+36 provided by Windows Calc...
Quote from: lucho on April 03, 2025, 01:40:25 AMWhen I claim that 4 GB of address space is not enough, I mean that the total RAM usage of today's software that resides at the same time in memory exceeds 4 GB
Sorry, but total RAM use of all running processes is irrelevant. They all have their separate address space. Total RAM used by all processes together can exceed 4GB due to swapping and paging.
I'm old-school enough to remember that you NEVER used memory, always registers, and that carries on for me today.
Having 4GB of address space means having 2GB of usable space (usually). My database is not there yet (just over 1GB) but it's grown to that size in just over 2 years so I am (fingers crossed) future-proof.
It's no big deal, 32- or 64-bit. There are a few nifty little tricks in 64-bit but if you're satisfied with 32-bit then don't change. The MASM64 SDK is all but useless and unfortunately the 32-bit SDK is missing anything from the last 10-15 years (Windows 7 was released in 2009).
Quote from: sinsi on April 03, 2025, 07:28:48 PMthe 32-bit SDK is missing anything from the last 10-15 years
True but not a major obstacle. You can add the handful of
important new functions by hand.
Quote from: jj2007 on April 03, 2025, 06:58:03 PMWhich is, of course, much less accurate than the 2,4691358024691358019753086419753e+36 provided by Windows Calc...
Even this 2469135802469135801975308641975300000 is not the same as the correct
result of 2469135802469135801975308641975308642.
QuoteSorry, but total RAM use of all running processes is irrelevant. They all have their separate address space. Total RAM used by all processes together can exceed 4GB due to swapping and paging.
Agreed, of course. Never claimed that the 4GB barrier is important for a single process or thread.
Note however that the total virtual address space (incl. swapping/paging) for 32-bit CPUs is 4GB.
And it remains 4GB even if "physical address extension" is used to extend the physical RAM size.