Testing various ways to pass one arg on the stack, and to preserve regs:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
467 cycles for 100 * pop retadd, pop arg, push retadd
483 cycles for 100 * pop retadd, pop arg, jmp retadd
565 cycles for 100 * mov eax, arg/ret
873 cycles for 100 * push esi edi ebx ecx
2183 cycles for 100 * pushad
466 cycles for 100 * pop retadd, pop arg, push retadd
484 cycles for 100 * pop retadd, pop arg, jmp retadd
566 cycles for 100 * mov eax, arg/ret
874 cycles for 100 * push esi edi ebx ecx
2178 cycles for 100 * pushad
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
463 cycles for 100 * pop retadd, pop arg, push retadd
478 cycles for 100 * pop retadd, pop arg, jmp retadd
518 cycles for 100 * mov eax, arg/ret
777 cycles for 100 * push esi edi ebx ecx
2188 cycles for 100 * pushad
464 cycles for 100 * pop retadd, pop arg, push retadd
480 cycles for 100 * pop retadd, pop arg, jmp retadd
552 cycles for 100 * mov eax, arg/ret
776 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
464 cycles for 100 * pop retadd, pop arg, push retadd
478 cycles for 100 * pop retadd, pop arg, jmp retadd
536 cycles for 100 * mov eax, arg/ret
776 cycles for 100 * push esi edi ebx ecx
2186 cycles for 100 * pushad
463 cycles for 100 * pop retadd, pop arg, push retadd
479 cycles for 100 * pop retadd, pop arg, jmp retadd
555 cycles for 100 * mov eax, arg/ret
776 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
464 cycles for 100 * pop retadd, pop arg, push retadd
479 cycles for 100 * pop retadd, pop arg, jmp retadd
548 cycles for 100 * mov eax, arg/ret
777 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
Quote
AMD Athlon(tm) II X2 250 Processor (SSE3)
631 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
433 cycles for 100 * mov eax, arg/ret
970 cycles for 100 * push esi edi ebx ecx
1489 cycles for 100 * pushad
703 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
976 cycles for 100 * push esi edi ebx ecx
1476 cycles for 100 * pushad
668 cycles for 100 * pop retadd, pop arg, push retadd
433 cycles for 100 * pop retadd, pop arg, jmp retadd
426 cycles for 100 * mov eax, arg/ret
971 cycles for 100 * push esi edi ebx ecx
1486 cycles for 100 * pushad
699 cycles for 100 * pop retadd, pop arg, push retadd
425 cycles for 100 * pop retadd, pop arg, jmp retadd
425 cycles for 100 * mov eax, arg/ret
967 cycles for 100 * push esi edi ebx ecx
1487 cycles for 100 * pushad
659 cycles for 100 * pop retadd, pop arg, push retadd
434 cycles for 100 * pop retadd, pop arg, jmp retadd
426 cycles for 100 * mov eax, arg/ret
971 cycles for 100 * push esi edi ebx ecx
1518 cycles for 100 * pushad
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
--- ok ---
Quote426 cycles for 100 * pop retadd, pop arg, jmp retadd
My Athlon is the fastest!
Quote
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4) AMD Athlon(tm) II X2 250 Processor (SSE3)
cycles for 100 * pop retadd, pop arg, push retadd 467 463 631
cycles for 100 * pop retadd, pop arg, jmp retadd 483 478 426
cycles for 100 * mov eax, arg/ret 565 518 433
cycles for 100 * push esi edi ebx ecx 873 777 970
cycles for 100 * pushad 2183 2188 1489
cycles for 100 * pop retadd, pop arg, push retadd 466 464 703
cycles for 100 * pop retadd, pop arg, jmp retadd 484 480 426
cycles for 100 * mov eax, arg/ret 566 552 428
cycles for 100 * push esi edi ebx ecx 874 776 976
cycles for 100 * pushad 2178 2185 1476
cycles for 100 * pop retadd, pop arg, push retadd 464 668
cycles for 100 * pop retadd, pop arg, jmp retadd 478 433
cycles for 100 * mov eax, arg/ret 536 426
cycles for 100 * push esi edi ebx ecx 776 971
cycles for 100 * pushad 2186 1486
cycles for 100 * pop retadd, pop arg, push retadd 463 699
cycles for 100 * pop retadd, pop arg, jmp retadd 479 425
cycles for 100 * mov eax, arg/ret 555 425
cycles for 100 * push esi edi ebx ecx 776 967
cycles for 100 * pushad 2185 1487
cycles for 100 * pop retadd, pop arg, push retadd 464 659
cycles for 100 * pop retadd, pop arg, jmp retadd 479 434
cycles for 100 * mov eax, arg/ret 548 426
cycles for 100 * push esi edi ebx ecx 777 971
cycles for 100 * pushad 2185 1518
11 bytes for pop retadd, pop arg, push retadd 11 11
11 bytes for pop retadd, pop arg, jmp retadd 11 11
15 bytes for mov eax, arg/ret 15 15
31 bytes for push esi edi ebx ecx 31 31
27 bytes for pushad 27 27
Quote from: Grincheux on December 09, 2015, 07:39:49 AM
My Athlon is the fastest!
It seems so :t
However, Intel is faster for
pop edx ; ret addr
pop eax ; arg
push edx ; ret addr
Older AMDAMD Athlon(tm) II X2 220 Processor (SSE3) 2.8 GHz
643 cycles for 100 * pop retadd, pop arg, push retadd
432 cycles for 100 * pop retadd, pop arg, jmp retadd
429 cycles for 100 * mov eax, arg/ret
969 cycles for 100 * push esi edi ebx ecx
1478 cycles for 100 * pushad
633 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
969 cycles for 100 * push esi edi ebx ecx
1477 cycles for 100 * pushad
653 cycles for 100 * pop retadd, pop arg, push retadd
428 cycles for 100 * pop retadd, pop arg, jmp retadd
434 cycles for 100 * mov eax, arg/ret
978 cycles for 100 * push esi edi ebx ecx
1500 cycles for 100 * pushad
632 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
429 cycles for 100 * mov eax, arg/ret
968 cycles for 100 * push esi edi ebx ecx
1493 cycles for 100 * pushad
632 cycles for 100 * pop retadd, pop arg, push retadd
427 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
975 cycles for 100 * push esi edi ebx ecx
1497 cycles for 100 * pushad
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
1071 cycles for 100 * pop retadd, pop arg, push retadd
699 cycles for 100 * pop retadd, pop arg, jmp retadd
876 cycles for 100 * mov eax, arg/ret
1575 cycles for 100 * push esi edi ebx ecx
3614 cycles for 100 * pushad
1083 cycles for 100 * pop retadd, pop arg, push retadd
699 cycles for 100 * pop retadd, pop arg, jmp retadd
878 cycles for 100 * mov eax, arg/ret
1539 cycles for 100 * push esi edi ebx ecx
3578 cycles for 100 * pushad
1062 cycles for 100 * pop retadd, pop arg, push retadd
701 cycles for 100 * pop retadd, pop arg, jmp retadd
919 cycles for 100 * mov eax, arg/ret
1559 cycles for 100 * push esi edi ebx ecx
3593 cycles for 100 * pushad
1059 cycles for 100 * pop retadd, pop arg, push retadd
741 cycles for 100 * pop retadd, pop arg, jmp retadd
865 cycles for 100 * mov eax, arg/ret
1547 cycles for 100 * push esi edi ebx ecx
3515 cycles for 100 * pushad
1082 cycles for 100 * pop retadd, pop arg, push retadd
701 cycles for 100 * pop retadd, pop arg, jmp retadd
866 cycles for 100 * mov eax, arg/ret
1715 cycles for 100 * push esi edi ebx ecx
3526 cycles for 100 * pushad
Thanks :icon14:
So it seems
pop edx ; ret addr
pop eax ; arg
push edx ; ret addr
is good on Core ix but not so good on anything else. The Lingo-style jmp edx is not really an option, as you can rarely preserve edx until the final ret.
When we have 3 ou 4 parameters or more is it quicker to pass a structure?
Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz (SSE4)
424 cycles for 100 * pop retadd, pop arg, push retadd
451 cycles for 100 * pop retadd, pop arg, jmp retadd
520 cycles for 100 * mov eax, arg/ret
804 cycles for 100 * push esi edi ebx ecx
2234 cycles for 100 * pushad
424 cycles for 100 * pop retadd, pop arg, push retadd
442 cycles for 100 * pop retadd, pop arg, jmp retadd
512 cycles for 100 * mov eax, arg/ret
797 cycles for 100 * push esi edi ebx ecx
2198 cycles for 100 * pushad
422 cycles for 100 * pop retadd, pop arg, push retadd
442 cycles for 100 * pop retadd, pop arg, jmp retadd
526 cycles for 100 * mov eax, arg/ret
791 cycles for 100 * push esi edi ebx ecx
2184 cycles for 100 * pushad
420 cycles for 100 * pop retadd, pop arg, push retadd
439 cycles for 100 * pop retadd, pop arg, jmp retadd
519 cycles for 100 * mov eax, arg/ret
792 cycles for 100 * push esi edi ebx ecx
2188 cycles for 100 * pushad
422 cycles for 100 * pop retadd, pop arg, push retadd
439 cycles for 100 * pop retadd, pop arg, jmp retadd
520 cycles for 100 * mov eax, arg/ret
794 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
With up to 3 arguments, register passing usually is a lot faster as it has no stack overhead at all. Basically its a roll your own version of fastcall.
AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)
496 cycles for 100 * pop retadd, pop arg, push retadd
424 cycles for 100 * pop retadd, pop arg, jmp retadd
425 cycles for 100 * mov eax, arg/ret
733 cycles for 100 * push esi edi ebx ecx
1231 cycles for 100 * pushad
470 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
733 cycles for 100 * push esi edi ebx ecx
1231 cycles for 100 * pushad
642 cycles for 100 * pop retadd, pop arg, push retadd
424 cycles for 100 * pop retadd, pop arg, jmp retadd
425 cycles for 100 * mov eax, arg/ret
733 cycles for 100 * push esi edi ebx ecx
1236 cycles for 100 * pushad
475 cycles for 100 * pop retadd, pop arg, push retadd
425 cycles for 100 * pop retadd, pop arg, jmp retadd
425 cycles for 100 * mov eax, arg/ret
733 cycles for 100 * push esi edi ebx ecx
1231 cycles for 100 * pushad
471 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
738 cycles for 100 * push esi edi ebx ecx
1231 cycles for 100 * pushad
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
--- ok ---
Quote from: hutch-- on December 30, 2015, 08:02:15 PM
With up to 3 arguments, register passing usually is a lot faster as it has no stack overhead at all. Basically its a roll your own version of fastcall.
But you must move your args into the regs, unless they are already there. In practice, there is not much difference, see last two entries below.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
470 cycles for 100 * pop retadd, pop arg, push retadd
490 cycles for 100 * pop retadd, pop arg, jmp retadd
569 cycles for 100 * mov eax, arg/ret
879 cycles for 100 * push esi edi ebx ecx
2199 cycles for 100 * pushad
473 cycles for 100 * popretadd, 2 args
470 cycles for 100 * 2 args via reg
470 cycles for 100 * pop retadd, pop arg, push retadd
486 cycles for 100 * pop retadd, pop arg, jmp retadd
526 cycles for 100 * mov eax, arg/ret
875 cycles for 100 * push esi edi ebx ecx
2189 cycles for 100 * pushad
472 cycles for 100 * popretadd, 2 args
472 cycles for 100 * 2 args via reg
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
464 cycles for 100 * pop retadd, pop arg, push retadd
477 cycles for 100 * pop retadd, pop arg, jmp retadd
555 cycles for 100 * mov eax, arg/ret
778 cycles for 100 * push esi edi ebx ecx
2186 cycles for 100 * pushad
465 cycles for 100 * popretadd, 2 args
465 cycles for 100 * 2 args via reg
464 cycles for 100 * pop retadd, pop arg, push retadd
478 cycles for 100 * pop retadd, pop arg, jmp retadd
544 cycles for 100 * mov eax, arg/ret
778 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
465 cycles for 100 * popretadd, 2 args
466 cycles for 100 * 2 args via reg
464 cycles for 100 * pop retadd, pop arg, push retadd
478 cycles for 100 * pop retadd, pop arg, jmp retadd
541 cycles for 100 * mov eax, arg/ret
777 cycles for 100 * push esi edi ebx ecx
2183 cycles for 100 * pushad
463 cycles for 100 * popretadd, 2 args
464 cycles for 100 * 2 args via reg
464 cycles for 100 * pop retadd, pop arg, push retadd
478 cycles for 100 * pop retadd, pop arg, jmp retadd
541 cycles for 100 * mov eax, arg/ret
778 cycles for 100 * push esi edi ebx ecx
2184 cycles for 100 * pushad
465 cycles for 100 * popretadd, 2 args
464 cycles for 100 * 2 args via reg
464 cycles for 100 * pop retadd, pop arg, push retadd
479 cycles for 100 * pop retadd, pop arg, jmp retadd
551 cycles for 100 * mov eax, arg/ret
778 cycles for 100 * push esi edi ebx ecx
2185 cycles for 100 * pushad
464 cycles for 100 * popretadd, 2 args
465 cycles for 100 * 2 args via reg
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
13 bytes for popretadd, 2 args
15 bytes for 2 args via reg
pre-P4 (SSE1)
608 cycles for 100 * pop retadd, pop arg, push retadd
517 cycles for 100 * pop retadd, pop arg, jmp retadd
711 cycles for 100 * mov eax, arg/ret
1519 cycles for 100 * push esi edi ebx ecx
2150 cycles for 100 * pushad
809 cycles for 100 * popretadd, 2 args
504 cycles for 100 * 2 args via reg
610 cycles for 100 * pop retadd, pop arg, push retadd
518 cycles for 100 * pop retadd, pop arg, jmp retadd
711 cycles for 100 * mov eax, arg/ret
1519 cycles for 100 * push esi edi ebx ecx
2145 cycles for 100 * pushad
810 cycles for 100 * popretadd, 2 args
504 cycles for 100 * 2 args via reg
613 cycles for 100 * pop retadd, pop arg, push retadd
517 cycles for 100 * pop retadd, pop arg, jmp retadd
711 cycles for 100 * mov eax, arg/ret
1521 cycles for 100 * push esi edi ebx ecx
2146 cycles for 100 * pushad
811 cycles for 100 * popretadd, 2 args
504 cycles for 100 * 2 args via reg
608 cycles for 100 * pop retadd, pop arg, push retadd
517 cycles for 100 * pop retadd, pop arg, jmp retadd
711 cycles for 100 * mov eax, arg/ret
1538 cycles for 100 * push esi edi ebx ecx
2148 cycles for 100 * pushad
814 cycles for 100 * popretadd, 2 args
504 cycles for 100 * 2 args via reg
608 cycles for 100 * pop retadd, pop arg, push retadd
530 cycles for 100 * pop retadd, pop arg, jmp retadd
711 cycles for 100 * mov eax, arg/ret
1518 cycles for 100 * push esi edi ebx ecx
2148 cycles for 100 * pushad
810 cycles for 100 * popretadd, 2 args
516 cycles for 100 * 2 args via reg
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
13 bytes for popretadd, 2 args
15 bytes for 2 args via reg
--- ok ---
As usual, AMD says "up yours"
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G (SSE4)
539 cycles for 100 * pop retadd, pop arg, push retadd
752 cycles for 100 * pop retadd, pop arg, jmp retadd
610 cycles for 100 * mov eax, arg/ret
1070 cycles for 100 * push esi edi ebx ecx
1998 cycles for 100 * pushad
718 cycles for 100 * popretadd, 2 args
330 cycles for 100 * 2 args via reg
548 cycles for 100 * pop retadd, pop arg, push retadd
610 cycles for 100 * pop retadd, pop arg, jmp retadd
615 cycles for 100 * mov eax, arg/ret
1091 cycles for 100 * push esi edi ebx ecx
1996 cycles for 100 * pushad
735 cycles for 100 * popretadd, 2 args
338 cycles for 100 * 2 args via reg
557 cycles for 100 * pop retadd, pop arg, push retadd
557 cycles for 100 * pop retadd, pop arg, jmp retadd
631 cycles for 100 * mov eax, arg/ret
1078 cycles for 100 * push esi edi ebx ecx
2010 cycles for 100 * pushad
739 cycles for 100 * popretadd, 2 args
339 cycles for 100 * 2 args via reg
546 cycles for 100 * pop retadd, pop arg, push retadd
722 cycles for 100 * pop retadd, pop arg, jmp retadd
649 cycles for 100 * mov eax, arg/ret
1085 cycles for 100 * push esi edi ebx ecx
1982 cycles for 100 * pushad
760 cycles for 100 * popretadd, 2 args
339 cycles for 100 * 2 args via reg
568 cycles for 100 * pop retadd, pop arg, push retadd
746 cycles for 100 * pop retadd, pop arg, jmp retadd
614 cycles for 100 * mov eax, arg/ret
1078 cycles for 100 * push esi edi ebx ecx
1998 cycles for 100 * pushad
735 cycles for 100 * popretadd, 2 args
356 cycles for 100 * 2 args via reg
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
13 bytes for popretadd, 2 args
15 bytes for 2 args via reg
AMD Athlon(tm) II X2 220 Processor (SSE3)
636 cycles for 100 * pop retadd, pop arg, push retadd
783 cycles for 100 * pop retadd, pop arg, jmp retadd
430 cycles for 100 * mov eax, arg/ret
968 cycles for 100 * push esi edi ebx ecx
1478 cycles for 100 * pushad
856 cycles for 100 * popretadd, 2 args
428 cycles for 100 * 2 args via reg
632 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
429 cycles for 100 * mov eax, arg/ret
969 cycles for 100 * push esi edi ebx ecx
1477 cycles for 100 * pushad
856 cycles for 100 * popretadd, 2 args
429 cycles for 100 * 2 args via reg
632 cycles for 100 * pop retadd, pop arg, push retadd
981 cycles for 100 * pop retadd, pop arg, jmp retadd
431 cycles for 100 * mov eax, arg/ret
968 cycles for 100 * push esi edi ebx ecx
1480 cycles for 100 * pushad
856 cycles for 100 * popretadd, 2 args
431 cycles for 100 * 2 args via reg
632 cycles for 100 * pop retadd, pop arg, push retadd
427 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
969 cycles for 100 * push esi edi ebx ecx
1477 cycles for 100 * pushad
857 cycles for 100 * popretadd, 2 args
472 cycles for 100 * 2 args via reg
632 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
968 cycles for 100 * push esi edi ebx ecx
1477 cycles for 100 * pushad
857 cycles for 100 * popretadd, 2 args
428 cycles for 100 * 2 args via reg
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
13 bytes for popretadd, 2 args
15 bytes for 2 args via reg
AMD Athlon(tm) II X2 250 Processor (SSE3)
1463 cycles for 100 * pop retadd, pop arg, push retadd
428 cycles for 100 * pop retadd, pop arg, jmp retadd
438 cycles for 100 * mov eax, arg/ret
979 cycles for 100 * push esi edi ebx ecx
1489 cycles for 100 * pushad
857 cycles for 100 * popretadd, 2 args
496 cycles for 100 * 2 args via reg
1438 cycles for 100 * pop retadd, pop arg, push retadd
426 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
970 cycles for 100 * push esi edi ebx ecx
1490 cycles for 100 * pushad
869 cycles for 100 * popretadd, 2 args
428 cycles for 100 * 2 args via reg
1331 cycles for 100 * pop retadd, pop arg, push retadd
1218 cycles for 100 * pop retadd, pop arg, jmp retadd
428 cycles for 100 * mov eax, arg/ret
979 cycles for 100 * push esi edi ebx ecx
1488 cycles for 100 * pushad
866 cycles for 100 * popretadd, 2 args
439 cycles for 100 * 2 args via reg
642 cycles for 100 * pop retadd, pop arg, push retadd
427 cycles for 100 * pop retadd, pop arg, jmp retadd
440 cycles for 100 * mov eax, arg/ret
984 cycles for 100 * push esi edi ebx ecx
1557 cycles for 100 * pushad
867 cycles for 100 * popretadd, 2 args
439 cycles for 100 * 2 args via reg
769 cycles for 100 * pop retadd, pop arg, push retadd
427 cycles for 100 * pop retadd, pop arg, jmp retadd
429 cycles for 100 * mov eax, arg/ret
979 cycles for 100 * push esi edi ebx ecx
1491 cycles for 100 * pushad
866 cycles for 100 * popretadd, 2 args
430 cycles for 100 * 2 args via reg
11 bytes for pop retadd, pop arg, push retadd
11 bytes for pop retadd, pop arg, jmp retadd
15 bytes for mov eax, arg/ret
31 bytes for push esi edi ebx ecx
27 bytes for pushad
13 bytes for popretadd, 2 args
15 bytes for 2 args via reg
--- ok ---