News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Passing args on the stack: what is fastest?

Started by jj2007, December 09, 2015, 05:43:42 AM

Previous topic - Next topic

jj2007

Testing various ways to pass one arg on the stack, and to preserve regs:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
467     cycles for 100 * pop retadd, pop arg, push retadd
483     cycles for 100 * pop retadd, pop arg, jmp retadd
565     cycles for 100 * mov eax, arg/ret
873     cycles for 100 * push esi edi ebx ecx
2183    cycles for 100 * pushad

466     cycles for 100 * pop retadd, pop arg, push retadd
484     cycles for 100 * pop retadd, pop arg, jmp retadd
566     cycles for 100 * mov eax, arg/ret
874     cycles for 100 * push esi edi ebx ecx
2178    cycles for 100 * pushad

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

463     cycles for 100 * pop retadd, pop arg, push retadd
478     cycles for 100 * pop retadd, pop arg, jmp retadd
518     cycles for 100 * mov eax, arg/ret
777     cycles for 100 * push esi edi ebx ecx
2188    cycles for 100 * pushad

464     cycles for 100 * pop retadd, pop arg, push retadd
480     cycles for 100 * pop retadd, pop arg, jmp retadd
552     cycles for 100 * mov eax, arg/ret
776     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad

464     cycles for 100 * pop retadd, pop arg, push retadd
478     cycles for 100 * pop retadd, pop arg, jmp retadd
536     cycles for 100 * mov eax, arg/ret
776     cycles for 100 * push esi edi ebx ecx
2186    cycles for 100 * pushad

463     cycles for 100 * pop retadd, pop arg, push retadd
479     cycles for 100 * pop retadd, pop arg, jmp retadd
555     cycles for 100 * mov eax, arg/ret
776     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad

464     cycles for 100 * pop retadd, pop arg, push retadd
479     cycles for 100 * pop retadd, pop arg, jmp retadd
548     cycles for 100 * mov eax, arg/ret
777     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad

11      bytes for pop retadd, pop arg, push retadd
11      bytes for pop retadd, pop arg, jmp retadd
15      bytes for mov eax, arg/ret
31      bytes for push esi edi ebx ecx
27      bytes for pushad

Creative coders use backward thinking techniques as a strategy.

Grincheux

Quote
AMD Athlon(tm) II X2 250 Processor (SSE3)

631   cycles for 100 * pop retadd, pop arg, push retadd
426   cycles for 100 * pop retadd, pop arg, jmp retadd
433   cycles for 100 * mov eax, arg/ret
970   cycles for 100 * push esi edi ebx ecx
1489   cycles for 100 * pushad

703   cycles for 100 * pop retadd, pop arg, push retadd
426   cycles for 100 * pop retadd, pop arg, jmp retadd
428   cycles for 100 * mov eax, arg/ret
976   cycles for 100 * push esi edi ebx ecx
1476   cycles for 100 * pushad

668   cycles for 100 * pop retadd, pop arg, push retadd
433   cycles for 100 * pop retadd, pop arg, jmp retadd
426   cycles for 100 * mov eax, arg/ret
971   cycles for 100 * push esi edi ebx ecx
1486   cycles for 100 * pushad

699   cycles for 100 * pop retadd, pop arg, push retadd
425   cycles for 100 * pop retadd, pop arg, jmp retadd
425   cycles for 100 * mov eax, arg/ret
967   cycles for 100 * push esi edi ebx ecx
1487   cycles for 100 * pushad

659   cycles for 100 * pop retadd, pop arg, push retadd
434   cycles for 100 * pop retadd, pop arg, jmp retadd
426   cycles for 100 * mov eax, arg/ret
971   cycles for 100 * push esi edi ebx ecx
1518   cycles for 100 * pushad

11   bytes for pop retadd, pop arg, push retadd
11   bytes for pop retadd, pop arg, jmp retadd
15   bytes for mov eax, arg/ret
31   bytes for push esi edi ebx ecx
27   bytes for pushad


--- ok ---

Grincheux

Quote426   cycles for 100 * pop retadd, pop arg, jmp retadd

My Athlon is the fastest!

Grincheux

Quote
   Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)   Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)   AMD Athlon(tm) II X2 250 Processor (SSE3)
cycles for 100 * pop retadd, pop arg, push retadd   467   463   631
cycles for 100 * pop retadd, pop arg, jmp retadd   483   478   426
cycles for 100 * mov eax, arg/ret   565   518   433
cycles for 100 * push esi edi ebx ecx   873   777   970
cycles for 100 * pushad   2183   2188   1489
         
cycles for 100 * pop retadd, pop arg, push retadd   466   464   703
cycles for 100 * pop retadd, pop arg, jmp retadd   484   480   426
cycles for 100 * mov eax, arg/ret   566   552   428
cycles for 100 * push esi edi ebx ecx   874   776   976
cycles for 100 * pushad   2178   2185   1476
         
cycles for 100 * pop retadd, pop arg, push retadd      464   668
cycles for 100 * pop retadd, pop arg, jmp retadd      478   433
cycles for 100 * mov eax, arg/ret      536   426
cycles for 100 * push esi edi ebx ecx      776   971
cycles for 100 * pushad      2186   1486
         
cycles for 100 * pop retadd, pop arg, push retadd      463   699
cycles for 100 * pop retadd, pop arg, jmp retadd      479   425
cycles for 100 * mov eax, arg/ret      555   425
cycles for 100 * push esi edi ebx ecx      776   967
cycles for 100 * pushad      2185   1487
         
cycles for 100 * pop retadd, pop arg, push retadd      464   659
cycles for 100 * pop retadd, pop arg, jmp retadd      479   434
cycles for 100 * mov eax, arg/ret      548   426
cycles for 100 * push esi edi ebx ecx      777   971
cycles for 100 * pushad      2185   1518
         
11      bytes for pop retadd, pop arg, push retadd      11   11
11      bytes for pop retadd, pop arg, jmp retadd      11   11
15      bytes for mov eax, arg/ret      15   15
31      bytes for push esi edi ebx ecx      31   31
27      bytes for pushad       27   27


jj2007

Quote from: Grincheux on December 09, 2015, 07:39:49 AM
My Athlon is the fastest!

It seems so :t

However, Intel is faster for
pop edx  ; ret addr
pop eax  ; arg
push edx ; ret addr

TWell

Older AMDAMD Athlon(tm) II X2 220 Processor (SSE3) 2.8 GHz

643     cycles for 100 * pop retadd, pop arg, push retadd
432     cycles for 100 * pop retadd, pop arg, jmp retadd
429     cycles for 100 * mov eax, arg/ret
969     cycles for 100 * push esi edi ebx ecx
1478    cycles for 100 * pushad

633     cycles for 100 * pop retadd, pop arg, push retadd
426     cycles for 100 * pop retadd, pop arg, jmp retadd
428     cycles for 100 * mov eax, arg/ret
969     cycles for 100 * push esi edi ebx ecx
1477    cycles for 100 * pushad

653     cycles for 100 * pop retadd, pop arg, push retadd
428     cycles for 100 * pop retadd, pop arg, jmp retadd
434     cycles for 100 * mov eax, arg/ret
978     cycles for 100 * push esi edi ebx ecx
1500    cycles for 100 * pushad

632     cycles for 100 * pop retadd, pop arg, push retadd
426     cycles for 100 * pop retadd, pop arg, jmp retadd
429     cycles for 100 * mov eax, arg/ret
968     cycles for 100 * push esi edi ebx ecx
1493    cycles for 100 * pushad

632     cycles for 100 * pop retadd, pop arg, push retadd
427     cycles for 100 * pop retadd, pop arg, jmp retadd
428     cycles for 100 * mov eax, arg/ret
975     cycles for 100 * push esi edi ebx ecx
1497    cycles for 100 * pushad

11      bytes for pop retadd, pop arg, push retadd
11      bytes for pop retadd, pop arg, jmp retadd
15      bytes for mov eax, arg/ret
31      bytes for push esi edi ebx ecx
27      bytes for pushad

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

1071    cycles for 100 * pop retadd, pop arg, push retadd
699     cycles for 100 * pop retadd, pop arg, jmp retadd
876     cycles for 100 * mov eax, arg/ret
1575    cycles for 100 * push esi edi ebx ecx
3614    cycles for 100 * pushad

1083    cycles for 100 * pop retadd, pop arg, push retadd
699     cycles for 100 * pop retadd, pop arg, jmp retadd
878     cycles for 100 * mov eax, arg/ret
1539    cycles for 100 * push esi edi ebx ecx
3578    cycles for 100 * pushad

1062    cycles for 100 * pop retadd, pop arg, push retadd
701     cycles for 100 * pop retadd, pop arg, jmp retadd
919     cycles for 100 * mov eax, arg/ret
1559    cycles for 100 * push esi edi ebx ecx
3593    cycles for 100 * pushad

1059    cycles for 100 * pop retadd, pop arg, push retadd
741     cycles for 100 * pop retadd, pop arg, jmp retadd
865     cycles for 100 * mov eax, arg/ret
1547    cycles for 100 * push esi edi ebx ecx
3515    cycles for 100 * pushad

1082    cycles for 100 * pop retadd, pop arg, push retadd
701     cycles for 100 * pop retadd, pop arg, jmp retadd
866     cycles for 100 * mov eax, arg/ret
1715    cycles for 100 * push esi edi ebx ecx
3526    cycles for 100 * pushad

jj2007

Thanks :icon14:

So it seems
pop edx  ; ret addr
pop eax  ; arg
push edx ; ret addr

is good on Core ix but not so good on anything else. The Lingo-style jmp edx is not really an option, as you can rarely preserve edx until the final ret.

Grincheux

When we have 3 ou 4 parameters or more is it quicker to pass a structure?

TouEnMasm


Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz (SSE4)

424     cycles for 100 * pop retadd, pop arg, push retadd
451     cycles for 100 * pop retadd, pop arg, jmp retadd
520     cycles for 100 * mov eax, arg/ret
804     cycles for 100 * push esi edi ebx ecx
2234    cycles for 100 * pushad

424     cycles for 100 * pop retadd, pop arg, push retadd
442     cycles for 100 * pop retadd, pop arg, jmp retadd
512     cycles for 100 * mov eax, arg/ret
797     cycles for 100 * push esi edi ebx ecx
2198    cycles for 100 * pushad

422     cycles for 100 * pop retadd, pop arg, push retadd
442     cycles for 100 * pop retadd, pop arg, jmp retadd
526     cycles for 100 * mov eax, arg/ret
791     cycles for 100 * push esi edi ebx ecx
2184    cycles for 100 * pushad

420     cycles for 100 * pop retadd, pop arg, push retadd
439     cycles for 100 * pop retadd, pop arg, jmp retadd
519     cycles for 100 * mov eax, arg/ret
792     cycles for 100 * push esi edi ebx ecx
2188    cycles for 100 * pushad

422     cycles for 100 * pop retadd, pop arg, push retadd
439     cycles for 100 * pop retadd, pop arg, jmp retadd
520     cycles for 100 * mov eax, arg/ret
794     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad

11      bytes for pop retadd, pop arg, push retadd
11      bytes for pop retadd, pop arg, jmp retadd
15      bytes for mov eax, arg/ret
31      bytes for push esi edi ebx ecx
27      bytes for pushad
Fa is a musical note to play with CL

hutch--

With up to 3 arguments, register passing usually is a lot faster as it has no stack overhead at all. Basically its a roll your own version of fastcall.

ragdog


AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)

496     cycles for 100 * pop retadd, pop arg, push retadd
424     cycles for 100 * pop retadd, pop arg, jmp retadd
425     cycles for 100 * mov eax, arg/ret
733     cycles for 100 * push esi edi ebx ecx
1231    cycles for 100 * pushad

470     cycles for 100 * pop retadd, pop arg, push retadd
426     cycles for 100 * pop retadd, pop arg, jmp retadd
428     cycles for 100 * mov eax, arg/ret
733     cycles for 100 * push esi edi ebx ecx
1231    cycles for 100 * pushad

642     cycles for 100 * pop retadd, pop arg, push retadd
424     cycles for 100 * pop retadd, pop arg, jmp retadd
425     cycles for 100 * mov eax, arg/ret
733     cycles for 100 * push esi edi ebx ecx
1236    cycles for 100 * pushad

475     cycles for 100 * pop retadd, pop arg, push retadd
425     cycles for 100 * pop retadd, pop arg, jmp retadd
425     cycles for 100 * mov eax, arg/ret
733     cycles for 100 * push esi edi ebx ecx
1231    cycles for 100 * pushad

471     cycles for 100 * pop retadd, pop arg, push retadd
426     cycles for 100 * pop retadd, pop arg, jmp retadd
428     cycles for 100 * mov eax, arg/ret
738     cycles for 100 * push esi edi ebx ecx
1231    cycles for 100 * pushad

11      bytes for pop retadd, pop arg, push retadd
11      bytes for pop retadd, pop arg, jmp retadd
15      bytes for mov eax, arg/ret
31      bytes for push esi edi ebx ecx
27      bytes for pushad


--- ok ---

jj2007

Quote from: hutch-- on December 30, 2015, 08:02:15 PM
With up to 3 arguments, register passing usually is a lot faster as it has no stack overhead at all. Basically its a roll your own version of fastcall.

But you must move your args into the regs, unless they are already there. In practice, there is not much difference, see last two entries below.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
470     cycles for 100 * pop retadd, pop arg, push retadd
490     cycles for 100 * pop retadd, pop arg, jmp retadd
569     cycles for 100 * mov eax, arg/ret
879     cycles for 100 * push esi edi ebx ecx
2199    cycles for 100 * pushad
473     cycles for 100 * popretadd, 2 args
470     cycles for 100 * 2 args via reg

470     cycles for 100 * pop retadd, pop arg, push retadd
486     cycles for 100 * pop retadd, pop arg, jmp retadd
526     cycles for 100 * mov eax, arg/ret
875     cycles for 100 * push esi edi ebx ecx
2189    cycles for 100 * pushad
472     cycles for 100 * popretadd, 2 args
472     cycles for 100 * 2 args via reg

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

464     cycles for 100 * pop retadd, pop arg, push retadd
477     cycles for 100 * pop retadd, pop arg, jmp retadd
555     cycles for 100 * mov eax, arg/ret
778     cycles for 100 * push esi edi ebx ecx
2186    cycles for 100 * pushad
465     cycles for 100 * popretadd, 2 args
465     cycles for 100 * 2 args via reg

464     cycles for 100 * pop retadd, pop arg, push retadd
478     cycles for 100 * pop retadd, pop arg, jmp retadd
544     cycles for 100 * mov eax, arg/ret
778     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad
465     cycles for 100 * popretadd, 2 args
466     cycles for 100 * 2 args via reg

464     cycles for 100 * pop retadd, pop arg, push retadd
478     cycles for 100 * pop retadd, pop arg, jmp retadd
541     cycles for 100 * mov eax, arg/ret
777     cycles for 100 * push esi edi ebx ecx
2183    cycles for 100 * pushad
463     cycles for 100 * popretadd, 2 args
464     cycles for 100 * 2 args via reg

464     cycles for 100 * pop retadd, pop arg, push retadd
478     cycles for 100 * pop retadd, pop arg, jmp retadd
541     cycles for 100 * mov eax, arg/ret
778     cycles for 100 * push esi edi ebx ecx
2184    cycles for 100 * pushad
465     cycles for 100 * popretadd, 2 args
464     cycles for 100 * 2 args via reg

464     cycles for 100 * pop retadd, pop arg, push retadd
479     cycles for 100 * pop retadd, pop arg, jmp retadd
551     cycles for 100 * mov eax, arg/ret
778     cycles for 100 * push esi edi ebx ecx
2185    cycles for 100 * pushad
464     cycles for 100 * popretadd, 2 args
465     cycles for 100 * 2 args via reg

11      bytes for pop retadd, pop arg, push retadd
11      bytes for pop retadd, pop arg, jmp retadd
15      bytes for mov eax, arg/ret
31      bytes for push esi edi ebx ecx
27      bytes for pushad
13      bytes for popretadd, 2 args
15      bytes for 2 args via reg
Creative coders use backward thinking techniques as a strategy.