News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Timings for 32-bit vs two variants of 64-bit assembly

Started by jj2007, July 25, 2016, 11:55:59 PM

Previous topic - Next topic

jj2007

I am playing with variants of coding styles, and would appreciate some timings. So far, this consists of a simple loop calling an empty function in three variants:
- ordinary 32-bit assembly
- 64-bit, args constructed by pushing & poppin'
- 64-bit, args constructed through a permanent stack frame/spill space ("compiler style")

First results:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

*** three variants ***
This code was assembled with ml in 32-bit mode

compiler style is OFF
** 124 ticks
** 109 ticks
** 125 ticks
** 109 ticks
** 109 ticks
70      bytes for proc
1126    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is ON
** 187 ticks
** 203 ticks
** 203 ticks
** 203 ticks
** 187 ticks
106     bytes for proc
2157    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is OFF
** 141 ticks
** 156 ticks
** 156 ticks
** 156 ticks
** 156 ticks
106     bytes for proc
1827    bytes for calling
total calls: 100000000


In case you want to build it yourself: latest version of RichMasm required, lines 7 (OPT_64) and 22 (jbCompStyle) are the interesting ones. There is an int 3 in jp7, too 8)

TWell


AMD Athlon(tm) II X2 220 Processor

*** three variants ***
This code was assembled with ml in 32-bit mode

compiler style is OFF
** 125 ticks
** 125 ticks
** 125 ticks
** 125 ticks
** 141 ticks
70      bytes for proc
1126    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is ON
** 218 ticks
** 234 ticks
** 235 ticks
** 234 ticks
** 250 ticks
106     bytes for proc
2157    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is OFF
** 172 ticks
** 172 ticks
** 172 ticks
** 188 ticks
** 171 ticks
106     bytes for proc
1827    bytes for calling
total calls: 100000000
EDIT: How about testing in x64 enter/leave and rsp sub/add ?

mabdelouahab


Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz

*** three variants ***
This code was assembled with ml in 32-bit mode

compiler style is OFF
** 187 ticks
** 125 ticks
** 141 ticks
** 125 ticks
** 125 ticks
70      bytes for proc
1126    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is ON
** 203 ticks
** 250 ticks
** 312 ticks
** 266 ticks
** 219 ticks
106     bytes for proc
2157    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is OFF
** 172 ticks
** 203 ticks
** 172 ticks
** 172 ticks
** 172 ticks
106     bytes for proc
1827    bytes for calling
total calls: 100000000
Appuyez sur une touche pour continuer...

Yuri


Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz

*** three variants ***
This code was assembled with ml in 32-bit mode

compiler style is OFF
** 125 ticks
** 110 ticks
** 125 ticks
** 109 ticks
** 125 ticks
70      bytes for proc
1126    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is ON
** 203 ticks
** 203 ticks
** 204 ticks
** 187 ticks
** 187 ticks
106     bytes for proc
2157    bytes for calling
total calls: 100000000
This code was assembled with ml64 in 64-bit mode

compiler style is OFF
** 156 ticks
** 141 ticks
** 156 ticks
** 156 ticks
** 141 ticks
106     bytes for proc
1827    bytes for calling
total calls: 100000000

jj2007

Thanks to everybody :t

I am still struggling with the technicalities. Now I mov immediates into their slot (instead of push imm, pop reg), which saves a few cycles in 64cs, but performance is still 32 > 64pushpop > 64compstyle

Quote from: TWell on July 26, 2016, 12:00:30 AMEDIT: How about testing in x64 enter/leave and rsp sub/add ?

Saw your edit only now, sorry. If I remember well, we tested that for 32-bit code in the Lab; enter was slow, leave was fast.

P.S.: Made a few tests, and for a naked procedure, enter is about 15% slower than push rbp + mov rbp, rsp

jj2007

New version attached. It allows to use the two styles (push+pop vs mov slot, arg) in the same program.
The timings are practically identical for both styles. Building the stack by pushing arguments creates much shorter code, though, so I'll probably leave the default on this variant.

Overall, however, the 32-bit code is a lot shorter and a lot faster than its 64-bit counterpart.

With the latest MasmBasic package, the attached source (when opened in RichMasm):
- builds the two versions by just changing OPT_64 1 to OPT_64 0;
- when RichMasm finds a non-commented int 3 in the source, it builds the "project" and launches the debugger (if you comment out the int 3 with F4, RichMasm launches the exe instead of the debugger);
- by default,
       for OPT_64 0 it uses \Masm32\OllyDbg\ollydbg.exe
       for OPT_64 1 it uses \Masm32\x64Dbg\release\x64\x64dbg.exe
- default paths can be changed in \Masm32\MasmBasic\Res\RichMasm.ini, see Deb32; the Cl32 is the debugger's class name, and serves to activate it in console applications (i.e. the console window goes automatically to the background; if you do a lot of debugging, this is a really valuable feature...); but caution, the installer will overwrite RichMasm.ini, so keep a copy.

In the *.asc source, there is a bookmark "pick your favourite assembler" to the right, a bit more than half way down; use it to test other assemblers. The Watcom family handles 32- and 64-bit code alike, ML needs two different executables. To care for this, if you set OPT_Assembler ML, RichMasm will use ML.exe for 32-bit builds but \masm32\bin\ml64 when using OPT_64 1. With this little trick, you really need to change only one option to switch between 32- and 64-bit builds. Btw in my tests ML64, AsmC and HJWasm behaved identically, except for the high level constructs, of course.

I have used the Olly clone "x64Dbg" for my first steps in 64-bit land; fearless provides a plugin (direct link) that allows to skip the int 3 exception by continuing to hit F7/F8/F9, just as in Olly. In the original x64dbg version, you can only hit Ctrl F8 to continue, which is a major nuisance for me and at least one more x64dbg user.

In RichMasm, the menu File/New Masm source offers in line 5
Dual 32/64 bit console/GUI templates that compile both as 64 and 32-bit applications

They both build by simply hitting F6. No additional 64-bit libraries required (if that doesn't work for some reasons, please shout 8)).

Finally, timings on a core i5:
This code was assembled with ml64 in 64-bit mode

callback, compiler style OFF:
** 546 ticks
** 531 ticks
** 530 ticks
callback, compiler style ON:
** 468 ticks
** 453 ticks
** 468 ticks
NO callback, compiler style OFF:
** 436 ticks
** 437 ticks
** 437 ticks
NO callback, compiler style ON:
** 406 ticks
** 390 ticks
** 405 ticks

145     bytes for p7cb1s0
158     bytes for p7cb1s1
105     bytes for p7cb0s0
132     bytes for p7cb0s1
5314    bytes for calling


This code was assembled with JWasm in 32-bit mode

callback, compiler style OFF:
** 405 ticks
** 390 ticks
** 406 ticks
callback, compiler style ON:
** 390 ticks
** 390 ticks
** 390 ticks
NO callback, compiler style OFF:
** 359 ticks
** 359 ticks
** 374 ticks
NO callback, compiler style ON:
** 359 ticks
** 359 ticks
** 358 ticks

82      bytes for p7cb1s0
82      bytes for p7cb1s1
64      bytes for p7cb0s0
64      bytes for p7cb0s1
3065    bytes for calling

hutch--

JJ,

On my Win10 Professional 64 bit, the first zip crashed on all examples. In the second zip the 32 bit version worked but the 64 bit version crashed.


callback, compiler style OFF:
** 360 ticks
** 359 ticks
** 328 ticks
callback, compiler style ON:
** 360 ticks
** 328 ticks
** 312 ticks
NO callback, compiler style OFF:
** 297 ticks
** 344 ticks
** 312 ticks
NO callback, compiler style ON:
** 328 ticks
** 344 ticks
** 297 ticks

82      bytes for p7cb1s0
82      bytes for p7cb1s1
64      bytes for p7cb0s0
64      bytes for p7cb0s1
3065    bytes for calling
total calls: 480000000, sum: -1143986176.000000

hutch--

Try this form of manual stack frame.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    .stackspace 8

    fn testme,chr$("The Message"),chr$("Title")

    void(ExitProcess,0)

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

testme proc

    push rbp                        ; preserve base pointer
    mov rbp, rsp                    ; stack pointer into ebp
    sub rsp, 128                    ; allocate LOCAL stack space

    mov [rbp+16], rcx               ; load registers onto stack
    mov [rbp+24], rdx

  ; -----------------------------
  ; write data to LOCAL variables
  ; -----------------------------
    mov QWORD PTR [rsp-64], 0       ; the handle
    mov QWORD PTR [rsp-72], MB_OK   ; style

    fn MessageBox,[rsp-64], \       ; handle
                  [rbp+16], \       ; message
                  [rbp+24], \       ; title
                  [rsp-72]          ; style

    mov rsp, rbp                    ; restore stack pointer
    pop rbp                         ; restore base pointer

    ret

testme endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

   end

jj2007

Quote from: hutch-- on July 28, 2016, 11:24:31 AMOn my Win10 Professional 64 bit, the first zip crashed on all examples. In the second zip the 32 bit version worked but the 64 bit version crashed.

Hutch,

Just built & tested it on my spare Win10 Home machine:
This code was assembled with ml64 in 64-bit mode

callback, compiler style OFF:
** 1140 ticks
** 985 ticks
** 1000 ticks
callback, compiler style ON:
** 828 ticks
** 844 ticks
** 828 ticks
NO callback, compiler style OFF:
** 875 ticks
** 875 ticks
** 890 ticks
NO callback, compiler style ON:
** 797 ticks
** 797 ticks
** 797 ticks

145     bytes for p7cb1s0
158     bytes for p7cb1s1
105     bytes for p7cb0s0
132     bytes for p7cb0s1
5314    bytes for calling


Does it give you an exception code?

Quote from: hutch-- on July 28, 2016, 04:18:36 PM
Try this form of manual stack frame.

testme proc

    push rbp                        ; preserve base pointer
    mov rbp, rsp                    ; stack pointer into ebp
    sub rsp, 128                    ; allocate LOCAL stack space

    mov [rbp+16], rcx               ; load registers onto stack
    mov [rbp+24], rdx

  ; -----------------------------
  ; write data to LOCAL variables
  ; -----------------------------
    mov QWORD PTR [rsp-64], 0       ; the handle
    mov QWORD PTR [rsp-72], MB_OK   ; style

    fn MessageBox,[rsp-64], \       ; handle
                  [rbp+16], \       ; message
                  [rbp+24], \       ; title
                  [rsp-72]          ; style

    mov rsp, rbp                    ; restore stack pointer
    pop rbp                         ; restore base pointer

    ret
testme endp

That is similar to my construction:

include \Masm32\MasmBasic\Res\JBasic.inc
.code
testme proc pText:DefSize, pTitle:DefSize
  jinvoke MessageBox, 0, pText, pTitle, MB_OK
  ret
testme endp
j@start
  int 3
  jinvoke testme, Chr$("The Message"), Chr$("Title")
j@end


Which translates to this in "nornal" mode:
0000000140001000  | C3                        | ret                               |
0000000140001001  | 55                        | push rbp                          | start testme
0000000140001002  | 48 8B EC                  | mov rbp, rsp                      | create stack frame
0000000140001005  | 48 8B F6                  | mov rsi, rsi                      | ignore, for testing
0000000140001008  | 6A 00                     | push 0                            | MB_OK
000000014000100A  | 4C 8B 0C 24               | mov r9, qword ptr ss:[rsp]        | set register arg
000000014000100E  | 4C 8B 45 18               | mov r8, qword ptr ss:[rbp+18]     | [rbp+18]:"Title"
0000000140001012  | 41 50                     | push r8                           | set register arg
0000000140001014  | 48 8B 55 10               | mov rdx, qword ptr ss:[rbp+10]    | rdx:"Title", [rbp+10]:"The Message"
0000000140001018  | 52                        | push rdx                          | rdx:"Title"
0000000140001019  | 6A 00                     | push 0                            | set register arg
000000014000101B  | 48 8B 0C 24               | mov rcx, qword ptr ss:[rsp]       | rcx:"The Message"
000000014000101F  | FF 15 0B 11 00 00         | call qword ptr ds:[<&MessageBoxA> |
0000000140001025  | 48 83 C4 20               | add rsp, 20                       | correct the stack
0000000140001029  | C9                        | leave                             |
000000014000102A  | C3                        | ret                               |
000000014000102B  | 48 83 EC 58               | sub rsp, 58                       |
000000014000102F  | E8 C4 00 00 00            | call skeldualconsole64.1400010F8  | load libraries
0000000140001034  | CC                        | int3                              |
0000000140001035  | 53                        | push rbx                          | fill stack
0000000140001036  | 53                        | push rbx                          |
0000000140001037  | 48 8B F6                  | mov rsi, rsi                      | ignore, for testing
000000014000103A  | 48 8D 15 CF 0F 00 00      | lea rdx, qword ptr ds:[140002010] | rdx:"Title", 140002010:"Title"
0000000140001041  | 52                        | push rdx                          | rdx:"Title"
0000000140001042  | 48 8D 0D BB 0F 00 00      | lea rcx, qword ptr ds:[140002004] | rcx:"The Message", 140002004:"The Message"
0000000140001049  | 51                        | push rcx                          | rcx:"The Message"
000000014000104A  | E8 B2 FF FF FF            | call skeldualconsole64.140001001  | call testme
000000014000104F  | 48 83 C4 20               | add rsp, 20                       | correct the stack


Same but "compiler style":0000000140001001  | 55                        | push rbp                          | start testme
0000000140001002  | 48 8B EC                  | mov rbp, rsp                      | create stack frame
0000000140001005  | 48 89 4D 10               | mov qword ptr ss:[rbp+10], rcx    | put args into their memory slots
0000000140001009  | 48 89 55 18               | mov qword ptr ss:[rbp+18], rdx    | [rbp+18]:"Title", rdx:"Title"
000000014000100D  | 4C 89 45 20               | mov qword ptr ss:[rbp+20], r8     |
0000000140001011  | 4C 89 4D 28               | mov qword ptr ss:[rbp+28], r9     | [rbp+28]:"0'|w", r9:"0'|w"
0000000140001015  | 48 8B F6                  | mov rsi, rsi                      | ignore, for testing
0000000140001018  | 49 C7 C1 00 00 00 00      | mov r9, 0                         | that one could be shorter
000000014000101F  | 4C 89 4C 24 18            | mov qword ptr ss:[rsp+18], r9     | [rsp+18]:"Title", r9:"0'|w"
0000000140001024  | 4C 8B 45 18               | mov r8, qword ptr ss:[rbp+18]     | [rbp+18]:"Title"
0000000140001028  | 4C 89 44 24 10            | mov qword ptr ss:[rsp+10], r8     | [rsp+10]:"The Message"
000000014000102D  | 48 8B 55 10               | mov rdx, qword ptr ss:[rbp+10]    | rdx:"Title", [rbp+10]:"The Message"
0000000140001031  | 48 89 54 24 08            | mov qword ptr ss:[rsp+8], rdx     | rdx:"Title"
0000000140001036  | 48 C7 C1 00 00 00 00      | mov rcx, 0                        | rcx:"The Message"
000000014000103D  | 48 89 0C 24               | mov qword ptr ss:[rsp], rcx       | rcx:"The Message"
0000000140001041  | FF 15 E9 10 00 00         | call qword ptr ds:[<&MessageBoxA> |
0000000140001047  | C9                        | leave                             |
0000000140001048  | C3                        | ret                               |