I am playing with variants of coding styles, and would appreciate some timings. So far, this consists of a simple loop calling an empty function in three variants:
- ordinary 32-bit assembly
- 64-bit, args constructed by pushing & poppin'
- 64-bit, args constructed through a permanent stack frame/spill space ("compiler style")
First results:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
*** three variants ***
This code was assembled with ml in 32-bit mode
compiler style is OFF
** 124 ticks
** 109 ticks
** 125 ticks
** 109 ticks
** 109 ticks
70 bytes for proc
1126 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is ON
** 187 ticks
** 203 ticks
** 203 ticks
** 203 ticks
** 187 ticks
106 bytes for proc
2157 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is OFF
** 141 ticks
** 156 ticks
** 156 ticks
** 156 ticks
** 156 ticks
106 bytes for proc
1827 bytes for calling
total calls: 100000000
In case you want to build it yourself: latest version of RichMasm required (http://masm32.com/board/index.php?topic=94.0), lines 7 (OPT_64) and 22 (jbCompStyle) are the interesting ones. There is an int 3 in jp7, too 8)
AMD Athlon(tm) II X2 220 Processor
*** three variants ***
This code was assembled with ml in 32-bit mode
compiler style is OFF
** 125 ticks
** 125 ticks
** 125 ticks
** 125 ticks
** 141 ticks
70 bytes for proc
1126 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is ON
** 218 ticks
** 234 ticks
** 235 ticks
** 234 ticks
** 250 ticks
106 bytes for proc
2157 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is OFF
** 172 ticks
** 172 ticks
** 172 ticks
** 188 ticks
** 171 ticks
106 bytes for proc
1827 bytes for calling
total calls: 100000000
EDIT: How about testing in x64 enter/leave and rsp sub/add ?
Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
*** three variants ***
This code was assembled with ml in 32-bit mode
compiler style is OFF
** 187 ticks
** 125 ticks
** 141 ticks
** 125 ticks
** 125 ticks
70 bytes for proc
1126 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is ON
** 203 ticks
** 250 ticks
** 312 ticks
** 266 ticks
** 219 ticks
106 bytes for proc
2157 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is OFF
** 172 ticks
** 203 ticks
** 172 ticks
** 172 ticks
** 172 ticks
106 bytes for proc
1827 bytes for calling
total calls: 100000000
Appuyez sur une touche pour continuer...
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz
*** three variants ***
This code was assembled with ml in 32-bit mode
compiler style is OFF
** 125 ticks
** 110 ticks
** 125 ticks
** 109 ticks
** 125 ticks
70 bytes for proc
1126 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is ON
** 203 ticks
** 203 ticks
** 204 ticks
** 187 ticks
** 187 ticks
106 bytes for proc
2157 bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode
compiler style is OFF
** 156 ticks
** 141 ticks
** 156 ticks
** 156 ticks
** 141 ticks
106 bytes for proc
1827 bytes for calling
total calls: 100000000
Thanks to everybody :t
I am still struggling with the technicalities. Now I
mov immediates into their slot (instead of push imm, pop reg), which saves a few cycles in 64cs, but performance is still 32 > 64pushpop > 64compstyle
Quote from: TWell on July 26, 2016, 12:00:30 AMEDIT: How about testing in x64 enter/leave and rsp sub/add ?
Saw your edit only now, sorry. If I remember well, we tested that for 32-bit code in the Lab; enter was slow, leave was fast.
P.S.: Made a few tests, and for a naked procedure,
enter is about 15% slower than
push rbp + mov rbp, rsp
New version attached. It allows to use the two styles (push+pop vs mov slot, arg) in the same program.
The timings are practically identical for both styles. Building the stack by pushing arguments creates much shorter code, though, so I'll probably leave the default on this variant.
Overall, however, the 32-bit code is a lot shorter and a lot faster than its 64-bit counterpart.
With the latest MasmBasic package (http://masm32.com/board/index.php?topic=94.0), the attached source (when opened in RichMasm):
- builds the two versions by just changing OPT_64 1 to OPT_64 0;
- when RichMasm finds a non-commented int 3 in the source, it builds the "project" and launches the debugger (if you comment out the int 3 with F4, RichMasm launches the exe instead of the debugger);
- by default,
for OPT_64 0 it uses \Masm32\OllyDbg\ollydbg.exe
for OPT_64 1 it uses \Masm32\x64Dbg\release\x64\x64dbg.exe
- default paths can be changed in \Masm32\MasmBasic\Res\RichMasm.ini, see Deb32; the Cl32 is the debugger's class name, and serves to activate it in console applications (i.e. the console window goes automatically to the background; if you do a lot of debugging, this is a really valuable feature...); but caution, the installer will overwrite RichMasm.ini, so keep a copy.
In the *.asc source, there is a bookmark "pick your favourite assembler" to the right, a bit more than half way down; use it to test other assemblers. The Watcom family handles 32- and 64-bit code alike, ML needs two different executables. To care for this, if you set OPT_Assembler ML, RichMasm will use ML.exe for 32-bit builds but \masm32\bin\ml64 when using OPT_64 1. With this little trick, you really need to change only one option to switch between 32- and 64-bit builds. Btw in my tests ML64, AsmC and HJWasm behaved identically, except for the high level constructs, of course.
I have used the Olly clone "x64Dbg" for my first steps in 64-bit land; fearless provides a plugin (direct link (https://dl.dropboxusercontent.com/u/17077376/x64dbg%20Plugins/StepInt3.dp64)) that allows to skip the int 3 exception by continuing to hit F7/F8/F9, just as in Olly. In the original x64dbg version, you can only hit Ctrl F8 to continue, which is a major nuisance for me and at least one more x64dbg user.
In RichMasm, the menu File/New Masm source offers in line 5
Dual 32/64 bit console/GUI templates that compile both as 64 and 32-bit applications
They both build by simply hitting F6. No additional 64-bit libraries required (if that doesn't work for some reasons, please shout 8)).
Finally, timings on a core i5:
This code was assembled with ml64 in 64-bit mode
callback, compiler style OFF:
** 546 ticks
** 531 ticks
** 530 ticks
callback, compiler style ON:
** 468 ticks
** 453 ticks
** 468 ticks
NO callback, compiler style OFF:
** 436 ticks
** 437 ticks
** 437 ticks
NO callback, compiler style ON:
** 406 ticks
** 390 ticks
** 405 ticks
145 bytes for p7cb1s0
158 bytes for p7cb1s1
105 bytes for p7cb0s0
132 bytes for p7cb0s1
5314 bytes for calling
This code was assembled with JWasm in 32-bit mode
callback, compiler style OFF:
** 405 ticks
** 390 ticks
** 406 ticks
callback, compiler style ON:
** 390 ticks
** 390 ticks
** 390 ticks
NO callback, compiler style OFF:
** 359 ticks
** 359 ticks
** 374 ticks
NO callback, compiler style ON:
** 359 ticks
** 359 ticks
** 358 ticks
82 bytes for p7cb1s0
82 bytes for p7cb1s1
64 bytes for p7cb0s0
64 bytes for p7cb0s1
3065 bytes for calling
JJ,
On my Win10 Professional 64 bit, the first zip crashed on all examples. In the second zip the 32 bit version worked but the 64 bit version crashed.
callback, compiler style OFF:
** 360 ticks
** 359 ticks
** 328 ticks
callback, compiler style ON:
** 360 ticks
** 328 ticks
** 312 ticks
NO callback, compiler style OFF:
** 297 ticks
** 344 ticks
** 312 ticks
NO callback, compiler style ON:
** 328 ticks
** 344 ticks
** 297 ticks
82 bytes for p7cb1s0
82 bytes for p7cb1s1
64 bytes for p7cb0s0
64 bytes for p7cb0s1
3065 bytes for calling
total calls: 480000000, sum: -1143986176.000000
Try this form of manual stack frame.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
.stackspace 8
fn testme,chr$("The Message"),chr$("Title")
void(ExitProcess,0)
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
testme proc
push rbp ; preserve base pointer
mov rbp, rsp ; stack pointer into ebp
sub rsp, 128 ; allocate LOCAL stack space
mov [rbp+16], rcx ; load registers onto stack
mov [rbp+24], rdx
; -----------------------------
; write data to LOCAL variables
; -----------------------------
mov QWORD PTR [rsp-64], 0 ; the handle
mov QWORD PTR [rsp-72], MB_OK ; style
fn MessageBox,[rsp-64], \ ; handle
[rbp+16], \ ; message
[rbp+24], \ ; title
[rsp-72] ; style
mov rsp, rbp ; restore stack pointer
pop rbp ; restore base pointer
ret
testme endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Quote from: hutch-- on July 28, 2016, 11:24:31 AMOn my Win10 Professional 64 bit, the first zip crashed on all examples. In the second zip the 32 bit version worked but the 64 bit version crashed.
Hutch,
Just built & tested it on my spare Win10 Home machine:
This code was assembled with ml64 in 64-bit mode
callback, compiler style OFF:
** 1140 ticks
** 985 ticks
** 1000 ticks
callback, compiler style ON:
** 828 ticks
** 844 ticks
** 828 ticks
NO callback, compiler style OFF:
** 875 ticks
** 875 ticks
** 890 ticks
NO callback, compiler style ON:
** 797 ticks
** 797 ticks
** 797 ticks
145 bytes for p7cb1s0
158 bytes for p7cb1s1
105 bytes for p7cb0s0
132 bytes for p7cb0s1
5314 bytes for calling
Does it give you an exception code?
Quote from: hutch-- on July 28, 2016, 04:18:36 PM
Try this form of manual stack frame.
testme proc
push rbp ; preserve base pointer
mov rbp, rsp ; stack pointer into ebp
sub rsp, 128 ; allocate LOCAL stack space
mov [rbp+16], rcx ; load registers onto stack
mov [rbp+24], rdx
; -----------------------------
; write data to LOCAL variables
; -----------------------------
mov QWORD PTR [rsp-64], 0 ; the handle
mov QWORD PTR [rsp-72], MB_OK ; style
fn MessageBox,[rsp-64], \ ; handle
[rbp+16], \ ; message
[rbp+24], \ ; title
[rsp-72] ; style
mov rsp, rbp ; restore stack pointer
pop rbp ; restore base pointer
ret
testme endp
That is similar to my construction:
include \Masm32\MasmBasic\Res\JBasic.inc
.code
testme proc pText:DefSize, pTitle:DefSize
jinvoke MessageBox, 0, pText, pTitle, MB_OK
ret
testme endp
j@start
int 3
jinvoke testme, Chr$("The Message"), Chr$("Title")
j@endWhich translates to this in "nornal" mode:
0000000140001000 | C3 | ret |
0000000140001001 | 55 | push rbp | start testme
0000000140001002 | 48 8B EC | mov rbp, rsp | create stack frame
0000000140001005 | 48 8B F6 | mov rsi, rsi | ignore, for testing
0000000140001008 | 6A 00 | push 0 | MB_OK
000000014000100A | 4C 8B 0C 24 | mov r9, qword ptr ss:[rsp] | set register arg
000000014000100E | 4C 8B 45 18 | mov r8, qword ptr ss:[rbp+18] | [rbp+18]:"Title"
0000000140001012 | 41 50 | push r8 | set register arg
0000000140001014 | 48 8B 55 10 | mov rdx, qword ptr ss:[rbp+10] | rdx:"Title", [rbp+10]:"The Message"
0000000140001018 | 52 | push rdx | rdx:"Title"
0000000140001019 | 6A 00 | push 0 | set register arg
000000014000101B | 48 8B 0C 24 | mov rcx, qword ptr ss:[rsp] | rcx:"The Message"
000000014000101F | FF 15 0B 11 00 00 | call qword ptr ds:[<&MessageBoxA> |
0000000140001025 | 48 83 C4 20 | add rsp, 20 | correct the stack
0000000140001029 | C9 | leave |
000000014000102A | C3 | ret |
000000014000102B | 48 83 EC 58 | sub rsp, 58 |
000000014000102F | E8 C4 00 00 00 | call skeldualconsole64.1400010F8 | load libraries
0000000140001034 | CC | int3 |
0000000140001035 | 53 | push rbx | fill stack
0000000140001036 | 53 | push rbx |
0000000140001037 | 48 8B F6 | mov rsi, rsi | ignore, for testing
000000014000103A | 48 8D 15 CF 0F 00 00 | lea rdx, qword ptr ds:[140002010] | rdx:"Title", 140002010:"Title"
0000000140001041 | 52 | push rdx | rdx:"Title"
0000000140001042 | 48 8D 0D BB 0F 00 00 | lea rcx, qword ptr ds:[140002004] | rcx:"The Message", 140002004:"The Message"
0000000140001049 | 51 | push rcx | rcx:"The Message"
000000014000104A | E8 B2 FF FF FF | call skeldualconsole64.140001001 | call testme
000000014000104F | 48 83 C4 20 | add rsp, 20 | correct the stack
Same but "compiler style":
0000000140001001 | 55 | push rbp | start testme
0000000140001002 | 48 8B EC | mov rbp, rsp | create stack frame
0000000140001005 | 48 89 4D 10 | mov qword ptr ss:[rbp+10], rcx | put args into their memory slots
0000000140001009 | 48 89 55 18 | mov qword ptr ss:[rbp+18], rdx | [rbp+18]:"Title", rdx:"Title"
000000014000100D | 4C 89 45 20 | mov qword ptr ss:[rbp+20], r8 |
0000000140001011 | 4C 89 4D 28 | mov qword ptr ss:[rbp+28], r9 | [rbp+28]:"0'|w", r9:"0'|w"
0000000140001015 | 48 8B F6 | mov rsi, rsi | ignore, for testing
0000000140001018 | 49 C7 C1 00 00 00 00 | mov r9, 0 | that one could be shorter
000000014000101F | 4C 89 4C 24 18 | mov qword ptr ss:[rsp+18], r9 | [rsp+18]:"Title", r9:"0'|w"
0000000140001024 | 4C 8B 45 18 | mov r8, qword ptr ss:[rbp+18] | [rbp+18]:"Title"
0000000140001028 | 4C 89 44 24 10 | mov qword ptr ss:[rsp+10], r8 | [rsp+10]:"The Message"
000000014000102D | 48 8B 55 10 | mov rdx, qword ptr ss:[rbp+10] | rdx:"Title", [rbp+10]:"The Message"
0000000140001031 | 48 89 54 24 08 | mov qword ptr ss:[rsp+8], rdx | rdx:"Title"
0000000140001036 | 48 C7 C1 00 00 00 00 | mov rcx, 0 | rcx:"The Message"
000000014000103D | 48 89 0C 24 | mov qword ptr ss:[rsp], rcx | rcx:"The Message"
0000000140001041 | FF 15 E9 10 00 00 | call qword ptr ds:[<&MessageBoxA> |
0000000140001047 | C9 | leave |
0000000140001048 | C3 | ret |