mytest proc uses rsi rdi rbx gets ignored, apparently this feature doesn't exist in the SDK:
include \masm64\include64\masm64rt.inc ; *** test of the uses feature ***
.code
mytest proc uses rsi rdi rbx arg1, arg2, arg3, arg4, arg5
Local loc1, loc2, loc3
mov eax, arg1
mov loc1, eax
mov edx, arg5
mov loc3, edx
ret
mytest endp
entry_point proc
Local pContent:QWORD, ticks:QWORD ; OPT_Assembler ml64
lea rax, entry_point
conout "The entry point is at ", hex$(rax), lf
INT 3
invoke mytest, 11111111h, 22222222h, 33333333h, 44444444h, 55555555h
invoke ExitProcess, 0 ; terminate process
entry_point endp
end
000000013F8F107C | CC | int3 |
000000013F8F107D | 48:C7C1 11111111 | mov rcx,11111111 |
000000013F8F1084 | 48:C7C2 22222222 | mov rdx,22222222 |
000000013F8F108B | 49:C7C0 33333333 | mov r8,33333333 |
000000013F8F1092 | 49:C7C1 44444444 | mov r9,44444444 |
000000013F8F1099 | 48:C74424 20 55555555 | mov [rsp+20],55555555 |
000000013F8F10A2 | E8 59FFFFFF | call <sub_13F8F1000> |
...
000000013F8F1000 | C8 8000 00 | enter 80,0 |
000000013F8F1004 | 48:83EC 70 | sub rsp,70 |
000000013F8F1008 | 48:894D 10 | mov [rbp+10],rcx |
000000013F8F100C | 48:8955 18 | mov [rbp+18],rdx |
000000013F8F1010 | 4C:8945 20 | mov [rbp+20],r8 |
000000013F8F1014 | 4C:894D 28 | mov [rbp+28],r9 |
000000013F8F1018 | 8B45 10 | mov eax,[rbp+10] |
000000013F8F101B | 8945 9C | mov [rbp-64],eax |
000000013F8F101E | 8B55 30 | mov edx,[rbp+30] |
000000013F8F1021 | 8955 94 | mov [rbp-6C],edx |
000000013F8F1024 | C9 | leave |
000000013F8F1025 | C3 | ret |
If you feel that the | vertical | lines are a little bit misaligned: not my fault, it's a forum quirk. When posting, they look ok, but once you edit the post to correct a typo, they get misaligned.
Admin' EDIT: I straightened it up for you :smiley:
SDK have customized prologue and epilogue.
Hutch's design is to use macros:
mytest proc arg1, arg2, arg3, arg4, arg5
USING rsi rdi rbx
Local loc1, loc2, loc3
SaveRegs
mov eax, arg1
mov loc1, eax
mov edx, arg5
mov loc3, edx
RestoreRegs
ret
mytest endp
But of course you can make:
option PROLOGUE:Prologuedef
option EPILOGUE:Epiloguedef
mytest proc uses rsi rdi rbx arg1, arg2, arg3, arg4, arg5
Local loc1, loc2, loc3
mov eax, arg1
mov loc1, eax
mov edx, arg5
mov loc3, edx
ret
mytest endp
Search results for: "register preservation macros"
https://masm32.com/board/index.php?topic=7278.0
https://masm32.com/board/index.php?topic=7285.0 look especially here for explanation of "USING" versus "uses"
The search tool is your friend. :smiley:
If you look around, hutch was also experimenting with other similar variations to save and restore registers...
Interesting, thanks :thumbsup:
However, when I use prologuedef, this is the outcome:
000000013F841000 | 55 | push rbp |
000000013F841001 | 48:8BEC | mov rbp,rsp |
000000013F841004 | 48:83C4 F0 | add rsp,FFFFFFFFFFFFFFF0 |
000000013F841008 | 56 | push rsi |
000000013F841009 | 57 | push rdi |
000000013F84100A | 53 | push rbx |
000000013F84100B | 8B45 10 | mov eax,[rbp+10] |
000000013F84100E | 8945 FC | mov [rbp-4],eax |
000000013F841011 | 8B55 30 | mov edx,[rbp+30] |
000000013F841014 | 8955 F4 | mov [rbp-C],edx |
000000013F841017 | 5B | pop rbx |
000000013F841018 | 5F | pop rdi |
000000013F841019 | 5E | pop rsi |
000000013F84101A | C9 | leave |
000000013F84101B | C3 | ret |
push rbp etc is faster than enter, good, but after the three pushes rsp is aligned 8, not aligned 16 :rolleyes:
I also tested USING rsi rdi rbx - no effect :sad:
P.S., got it:
000000013FFC1000 | C8 8000 00 | enter 80,0 |
000000013FFC1004 | 48:81EC 90000000 | sub rsp,90 |
000000013FFC100B | 48:894D 10 | mov [rbp+10],rcx |
000000013FFC100F | 48:8975 80 | mov [rbp-80],rsi |
000000013FFC1013 | 48:897D 88 | mov [rbp-78],rdi |
000000013FFC1017 | 48:895D 90 | mov [rbp-70],rbx |
000000013FFC101B | 8B45 10 | mov eax,[rbp+10] |
000000013FFC101E | 8985 7CFFFFFF | mov [rbp-84],eax |
000000013FFC1024 | 8995 74FFFFFF | mov [rbp-8C],edx |
000000013FFC102A | 48:8B75 80 | mov rsi,[rbp-80] |
000000013FFC102E | 48:8B7D 88 | mov rdi,[rbp-78] |
000000013FFC1032 | 48:8B5D 90 | mov rbx,[rbp-70] |
000000013FFC1036 | C9 | leave |
000000013FFC1037 | C3 | ret
Still the horribly slow enter 80, 0, but the rest is ok. However, you need to use three macros to achieve that!
mytest proc arg1
USING rsi, rdi, rbx
Local loc1, loc2, loc3
SaveRegs
mov eax, arg1
mov loc1, eax
mov loc3, edx
RestoreRegs
ret
mytest endp
You have to use
SaveRegs and RestoreRegs as well as USING. It doesn't work automatically by just declaring USING... as in the first example that HSE posted.
Probably not a perfect solution, The Masm64 SDK is still in beta, after all... but it does seem to work.
If you think this is an "issue", maybe start a "Masm64 SDK Known Issues" thread? If such a thread is created, we can pin it to the top of the board for any other issues that come up to be added to that thread.
x64 ABI say that stack pointer don't need to be 16 aligned in leaf functions. Naturally, what for? (I'm learning that now :smiley: )
Quote from: HSE on August 28, 2023, 12:00:16 PMx64 ABI say that stack pointer don't need to be 16 aligned in leaf functions. Naturally, what for? (I'm learning that now :smiley: )
That's correct. Not a problem for Hutch' version of prologue, which is always aligned 16:
mytest proc arg1
USING rsi, rdi, rbx
Local loc1, loc2, loc3
SaveRegs
mov eax, arg1
mov loc1, eax
invoke MessageBoxA, 0, chr$("A message box"), chr$("Hi:"), MB_OK
mov loc3, edx
RestoreRegs
ret
mytest endp
It's pretty inefficient, though:
000000013F3E1024 | 31C9 | xor ecx,ecx |
000000013F3E1026 | 90 | nop |
000000013F3E1027 | 90 | nop |
000000013F3E1028 | 90 | nop |
000000013F3E1029 | 90 | nop |
000000013F3E102A | 90 | nop |
000000013F3E102B | 48:8B15 38200000 | mov rdx,[13F3E306A] | 000000013F3E306A:&"A message box"
000000013F3E1032 | 4C:8B05 3D200000 | mov r8,[13F3E3076] | 000000013F3E3076:&"Hi:"
000000013F3E1039 | 45:31C9 | xor r9d,r9d |
000000013F3E103C | 90 | nop |
000000013F3E103D | 90 | nop |
000000013F3E103E | 90 | nop |
000000013F3E103F | 90 | nop |
000000013F3E1040 | FF15 EA0F0000 | call [<&MessageBoxA>] |
Nine bytes less by using xor instead of mov reg, 0.
Honestly, this is pretty awful - I attach a testbed:
- Masm64 SDK does not crash, but it needs three macros to emulate some proc uses rsi rdi rbx, and the encoding is inefficient (mov rcx, 0 is 5 bytes longer than the equivalent xor ecx, ecx)
- Masm64 default prologue does not care of alignment and crashes
- unless you put the right number and type of locals there
include \masm64\include64\masm64rt.inc ; *** test of the uses feature ***
.code
mytestOk proc arg1 ; uses default Masm64 SDK prologue (Hutch)
USING rsi, rdi, rbx
Local loc1, loc2, loc3
SaveRegs
mov eax, arg1
mov loc1, eax
invoke MessageBoxA, 0, chr$("This message box works"), chr$("Hi:"), MB_OK
mov loc3, edx
RestoreRegs
ret
mytestOk endp
OPTION PROLOGUE:PrologueDef ; from here on, let MASM handle the stack frame
OPTION EPILOGUE:EpilogueDef
mytest proc uses rsi rdi rbx arg1
Local loc1, loc2, loc3
mov eax, arg1
mov loc1, eax
invoke MessageBoxA, 0, chr$("You won't see this one"), chr$("Hi:"), MB_OK
mov loc3, edx
ret
mytest endp
entry_point proc
Local pContent:QWORD, ticks:QWORD ; take away the locals to see a crash
invoke mytestOk, 11111111h ; this one works fine
invoke mytest, 11111111h ; this one crashes because of misalignment
invoke ExitProcess, 0
entry_point endp
end
Quote from: jj2007 on August 28, 2023, 07:06:00 PMMasm64 SDK does not crash,
The Masm64 SDK
cannot crash. It is a collection of include files, libraries, macros, examples, etc. Unless you are talking about code from a specific macro that crashes.
If you are talking about ml64 crashing, no one here has any control over how ml64 operates, nor did hutch. For that, you need to register a complaint with Microsoft.
Hutch had gone to great lengths to make the Masm64 SDK as easy to use as the Masm32 SDK. For some of ml64's shortcomings hutch has written macros. i.e., "USING", "SaveRegs", "RestoreRegs", and others.
Seems that you are trying to compare how ml64 operates versus uasm64. While uasm might be a great tool, it is not quite a drop-in replacement for ml64. There are too many differences. Uasm has options that ml64 does not. uasm uses prototypes, ml64 does not. That does not put the Masm64 SDK at fault though, as you seem to be implying but rather ml64 itself.
It appears that you had not read through hutch's posts regarding the efforts he had put into the Masm64 SDK, what he did to improve writing code
for ml64, to make life easier for the programmer. Some macros might just be inefficient, but hutch did what he had to do to
make it work. You had plenty of time while he was still alive to offer up suggestions while he was testing out different versions of the macros for saving and restoring the 64 bit registers and keeping the stack properly aligned and balanced, etc.
Saying that the Masm64 SDK is faulty for ml64's shortcomings is laughable. Imagine what hutch would say at that suggestion... besides Masm64 SDK is still in beta (i.e., a Work In Progress, and yet unfinished)
Just my thoughts at the moment. :azn:
Quote from: zedd151 on August 29, 2023, 12:11:37 AMQuote from: jj2007 on August 28, 2023, 07:06:00 PMMasm64 SDK does not crash,
The Masm64 SDK cannot crash. It is a collection of include files...
I apologise for my sloppy wording. Of course, I meant "code written with the Masm64 SDK does not crash".
Quote from: jj2007 on August 29, 2023, 12:38:28 AMI apologise for my sloppy wording. Of course, I meant "code written with the Masm64 SDK does not crash".
I kind of figured that is what you meant, but thanks for the clarification. ml64 is not a nice tool to use, all we can do is write proper code for it. hutch did a pretty good job, in what he has acheived with the Masm64 SDK.
Once stoo23 is able to get hutch's Masm64 stuff, there will most likely be some updated macros, perhaps a more efficient set for preserving the registers. We can only hope.
I am looking into doing more 64 bit programming, btw. Using ml64 of course. First I will try some more 32->64 conversions, with multiple procedures as opposed to the more simple conversions that I had done in the past, which largely were very easily converted. :biggrin:
As a side note:
One thing that really bothers me about the SDK is ".if rax { 4" style of ".if" notation. Probably a macro that uses textequ might be able to restore using "<", ">" so the code looks more normal. But needs further investigation, in another thread.
Hello,
Hutch did a great job to create and maintain the Masm64 package.
Here is a known method to preserve the volatile registers without specifiying USES :
include \masm32\include64\masm64rt.inc
.code
start PROC
invoke main,10,20,30,40
invoke ExitProcess,0
start ENDP
main PROC a:QWORD,b:QWORD,c:QWORD,d:QWORD
LOCAL .rsi:QWORD
LOCAL .rdi:QWORD
LOCAL .rbx:QWORD
mov .rsi,rsi
mov .rdi,rdi
mov .rbx,rbx
xor rbx,rbx
xor rsi,rsi
xor rdi,rdi
mov rsi,.rsi
mov rdi,.rdi
mov rbx,.rbx
mov rax,1
ret
main ENDP
END
Yes, that's one option, Erol, but do you realise that you replaced sometest proc uses rsi rdi rbx args with 12 lines of additional code? IMHO the prologue macro can handle this, without any user intervention.
Hi Jochen,
That's true. Hutch used this method in the Masm64 package. An example is \masm64\m64lib\stdout.asm. Maybe I am wrong but the direct register write might be faster than the push\pop pair.
Quote from: Vortex on August 29, 2023, 05:01:59 AMthe direct register write might be faster than the push\pop pair.
That should be tested, see attachment (pure Masm64 SDK code). Both methods require writing to and reading from memory.
Quote from: jj2007 on August 29, 2023, 05:19:50 AMThat should be tested
pushing took 1887 ms
moving took 1888 ms
pushing took 1622 ms ; <----
moving took 2153 ms ; <----
pushing took 1607 ms
moving took 1872 ms
pushing took 1607 ms
moving took 1887 ms
second run
pushing took 1872 ms
moving took 1887 ms
pushing took 1607 ms ; <----
moving took 2137 ms ; <----
pushing took 1623 ms
moving took 1872 ms
pushing took 1607 ms
moving took 1872 ms
third run
pushing took 1888 ms
moving took 1903 ms
pushing took 1623 ms ; <----
moving took 2152 ms ; <----
pushing took 1623 ms
moving took 1887 ms
pushing took 1607 ms
moving took 1888 ms
Odd, always the second iteration...
Hi Vortex,
Quote from: Vortex on August 29, 2023, 04:12:01 AMHere is a known method to preserve the volatile registers without specifiying USES :
I remember we help Hutch to make same thing with the macros. :thumbsup:
I think the idea behand this method is that you can trash the register in a first procedure part, and later you can use original value without need to store that twice. Registers stored by "uses" are a little hard to find from inside the procedure.
Hi HSE,
You are right, registers stored by "uses" are not easy to find.
By the way, it looks like that the push \ pop pair is faster than the mov instruction, I tested Jochen's code.
Quote from: Vortex on August 29, 2023, 06:45:58 AMit looks like that the push \ pop pair is faster than the mov instruction,
:biggrin: :biggrin:
pushing took 1063 ms
moving took 953 ms
pushing took 906 ms
moving took 828 ms
pushing took 938 ms
moving took 844 ms
pushing took 906 ms
moving took 953 ms
Here are my results :
pushing took 1825 ms
moving took 1810 ms
pushing took 1544 ms
moving took 2059 ms
pushing took 1529 ms
moving took 1857 ms
pushing took 1638 ms
moving took 1809 ms
Different processors, different results.
Quote from: zedd151 on August 29, 2023, 05:38:57 AMOdd, always the second iteration...
I added an align 16, now it looks stable on my machine:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
pushing took 1592 ms
moving took 1841 ms
pushing took 1575 ms
moving took 1825 ms
pushing took 1591 ms
moving took 1825 ms
pushing took 1592 ms
moving took 1856 ms
I also added a Masm64 SDK-compatible PrintCpu macro for Héctor ;-)
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
pushing took 1716 ms
moving took 2091 ms
pushing took 1794 ms
moving took 2090 ms
pushing took 1779 ms
moving took 2090 ms
pushing took 1763 ms
moving took 1934 ms
Press any key to continue . . .
Looks better
We need some AMD's
Quote from: zedd151 on August 29, 2023, 07:22:29 AMWe need some AMD's
Your wish is my command:
AMD Athlon Gold 3150U with Radeon Graphics
pushing took 1828 ms
moving took 1844 ms
pushing took 1875 ms
moving took 1984 ms
pushing took 1906 ms
moving took 1875 ms
pushing took 1875 ms
moving took 1906 ms
Quote from: jj2007 on August 29, 2023, 07:29:28 AMAMD Athlon Gold 3150U with Radeon Graphics
Not a big variance. A little flip-flopping, though. I would call it about even for your AMD.
AMD Ryzen 5 3400G
pushing took 1375 ms
moving took 1390 ms
pushing took 1375 ms
moving took 1407 ms
pushing took 1453 ms
moving took 1437 ms
pushing took 1469 ms
moving took 1641 ms
pushing took 1453 ms
moving took 1390 ms
pushing took 1391 ms
moving took 1390 ms
pushing took 1391 ms
moving took 1390 ms
pushing took 1391 ms
moving took 1406 ms
pushing took 1625 ms
moving took 1359 ms
pushing took 1359 ms
moving took 1375 ms
pushing took 1375 ms
moving took 1375 ms
pushing took 1360 ms
moving took 1359 ms
So it seems that AMD CPUs take exactly the same amount of cycles, while Intel CPUs are slightly faster with push & pop.
This is remarkable, since the hype around the x64 ABI is based on the idea that moving stuff is faster than pushing :cool:
x64 Architecture (https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/x64-architecture) is an interesting read. Did you know that you can align 16 the stack with a simple, short and spl, 0F0h?
48:83E4 F0 | and rsp,FFFFFFFFFFFFFFF0 | OK
83E4 F0 | and esp,FFFFFFF0 | not recommended, clears upper dword
66:83E4 F0 | and sp,FFF0 | OK
40:32E4 | xor spl,spl | align stack 256
40:80E4 F0 | and spl,F0 | OK
Another interesting bit:
QuoteThe caller reserves space on the stack for arguments passed in registers
It doesn't say anything about our dear habit to put a sub rsp, 80h somewhere on top of the proc. It just says for arguments passed in
registers, i.e. rcx, rdx, r8 and r9. At least that's what I read into this phrase - xmm0 is a register, right?
Quote from: jj2007 on August 29, 2023, 07:09:07 PMwhile Intel CPUs are slightly faster with push & pop.
Not exactly. Here results are same number: 5.59 cycles, and variance is so big (179 and 160 cycles^2) that it's not possible to say very much.
Picture is from mov, but pushpop is the same.
(https://i.postimg.cc/HVQ6SqXk/mov-stack.jpg) (https://postimg.cc/HVQ6SqXk)
What's your actual code? Here is mine:
method1:
push rsi
push rdi
push rbx
nop
pop rbx
pop rdi
pop rsi
ret
method2:
mov [rbp+16], rsi
mov [rbp+24], rdi
mov [rbp+32], rbx
nop
mov rbx, [rbp+32]
mov rdi, [rbp+32]
mov rsi, [rbp+32]
ret
...
REPEAT 4
mov ticks, rv(GetTickCount)
mov ecx, iterations
align 16
@@:
call method1
dec ecx
jns @B
sub rv(GetTickCount), ticks
invoke __imp__cprintf, cfm$("pushing took %i ms\n"), rax
mov ticks, rv(GetTickCount)
mov ecx, iterations
align 16
@@:
call method2
dec ecx
jns @B
sub rv(GetTickCount), ticks
invoke __imp__cprintf, cfm$("moving took %i ms\n"), rax
ENDM
function_under_glass5 macro
push rsi
push rdi
push rbx
nop
pop rbx
pop rdi
pop rsi
endm
function_under_glass6 macro
mov _rsi, rsi
mov _rdi, rdi
mov _rbx, rbx
nop
mov rbx, _rbx
mov rdi, _rdi
mov rsi, _rsi
endm
There is no call in this test.
.while !ZERO?
function_under_glass6
dec ebx
.endw
Quote from: HSE on August 29, 2023, 11:08:32 PMThere is no call in this test.
Interesting. "nc" stands for "no call":
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
*
pushing took 983 ms
moving took 1139 ms
nc pushing took 1030 ms
nc moving took 936 ms
pushing took 1045 ms
moving took 1139 ms
nc pushing took 1014 ms
nc moving took 952 ms
pushing took 998 ms
moving took 1201 ms
nc pushing took 1014 ms
nc moving took 936 ms
pushing took 967 ms
moving took 1263 ms
nc pushing took 1030 ms
nc moving took 951 ms
IMHO there should be a call, since we are talking about the best way to implement "uses rsi rdi". Anyway, it's an interesting result :cool:
Quote from: jj2007 on August 29, 2023, 11:27:19 PMIMHO there should be a call, since we are talking about the best way to implement "uses rsi rdi".
Correct. There was to much variations in your test. I tried to split the problem to see where variation is. Look like access to stack have that variation.