Author Topic: x86 Machine Code Statistics  (Read 1717 times)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6676
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: x86 Machine Code Statistics
« Reply #15 on: February 21, 2019, 03:10:09 PM »
> Microsoft compilers push ebp in the beginning but dont pop ebp in the end, do a leave.  :t

That is a characteristic of 32 bit MASM as well.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 9697
  • Assembler is fun ;-)
    • MasmBasic
Re: x86 Machine Code Statistics
« Reply #16 on: February 21, 2019, 03:12:58 PM »
Microsoft compilers push ebp in the beginning but dont pop ebp in the end, do a leave.

Not all of them, apparently:

Code: [Select]
char* somefunc(char* x, char* y) {
  char *instr;
  _asm int 3;
  instr=strstr(x, y);
  printf("in the func: %s\n", instr);
  return strstr(x, y);
}
Code: [Select]
Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86
Code: [Select]
00F7B490      Ú$  55               push ebp                      ; Tmp.00F7B490(guessed Arg1,Arg2)
00F7B491      ³.  8BEC             mov ebp, esp
00F7B493      ³.  51               push ecx
00F7B494      ³.  CC               int3
...
00F7B4C9      ³.  8BE5             mov esp, ebp
00F7B4CB      ³.  5D               pop ebp
00F7B4CC      À.  C3               retn

That is a characteristic of 32 bit MASM as well.

Masm32 returns indeed with leave & ret. Test it:
Code: [Select]
include \masm32\include\masm32rt.inc

.code
start:
  int 3
  inkey str$(find$(1, "txTest", "Test"))
  exit

end start

AW

  • Member
  • *****
  • Posts: 2338
  • Let's Make ASM Great Again!
Re: x86 Machine Code Statistics
« Reply #17 on: February 21, 2019, 05:06:23 PM »
It is true, they are not doing leave anymore in C/C++, probably because they want to insert security checks in the epilogue. I have no VC 6++ installed to see how things were in those times.  I would need to buy a floppy drive and install VC C++ 6.0 on a XP machine to figure it out. A lot of trouble.  :(

TimoVJL

  • Member
  • ***
  • Posts: 452
Re: x86 Machine Code Statistics
« Reply #18 on: February 21, 2019, 06:16:03 PM »
With option -O2
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86
Code: [Select]
_somefunc:
00000000  56                       push esi
00000001  57                       push edi
00000002  CC                       int3
...
00000027  5F                       pop edi
00000028  5E                       pop esi
00000029  C3                       ret
Code: [Select]
_somefunc:
00000000  55                       push ebp
00000001  8BEC                     mov ebp, esp
00000003  51                       push ecx
00000004  CC                       int3
...
00000039  8BE5                     mov esp, ebp
0000003B  5D                       pop ebp
0000003C  C3                       ret
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
Code: [Select]
00000000 55                   push ebp
00000001 8bec                 mov ebp, esp
00000003 51                   push ecx
00000004 53                   push ebx
00000005 56                   push esi
00000006 57                   push edi
00000007 cc                   int3
...
0000003c 5f                   pop edi
0000003d 5e                   pop esi
0000003e 5b                   pop ebx
0000003f 8be5                 mov esp, ebp
00000041 5d                   pop ebp
00000042 c3                   ret
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 9.00 for 80x86
Microsoft (R) 32-bit C/C++ Standard Compiler Version 10.00.6002 for 80x86
Code: [Select]
00000000 55                   push ebp
00000001 8bec                 mov ebp, esp
00000003 83ec04               sub esp, 4h
00000006 53                   push ebx
00000007 56                   push esi
00000008 57                   push edi
00000009 cc                   int3
...
00000043 5f                   pop edi
00000044 5e                   pop esi
00000045 5b                   pop ebx
00000046 c9                   leave
00000047 c3                   ret
Visual C++ 6, 5, 4.2/4, 2
May the source be with you

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6676
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: x86 Machine Code Statistics
« Reply #19 on: February 21, 2019, 06:36:18 PM »
Timo's 2nd example is how 32 bit CL.EXE has usually constructed a stack frame, using LEAVE is usually a trait of MASM code added into a C executable. You used to see it in some 32 bit OS code at very low levels where a reasonably obvious piece of code was written in MASM and linked into the application.

Don't laugh but I see this stuff like working on a T model Ford, cute, interesting but why bother.  :P
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 9697
  • Assembler is fun ;-)
    • MasmBasic
Re: x86 Machine Code Statistics
« Reply #20 on: February 21, 2019, 07:30:49 PM »
Apparently, speed-wise there is no difference:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

576     cycles for 100 * mov+pop
570     cycles for 100 * leave

565     cycles for 100 * mov+pop
565     cycles for 100 * leave

26      bytes for mov+pop
24      bytes for leave

sinsi

  • Member
  • *****
  • Posts: 1184
Re: x86 Machine Code Statistics
« Reply #21 on: February 21, 2019, 07:47:06 PM »
The Intel 80386, part 9: Stack frame instructions

Quote
Modern compilers avoid the ENTER instruction but keep the LEAVE instruction.
I can walk on water but stagger on beer bourbon.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6676
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: x86 Machine Code Statistics
« Reply #22 on: February 21, 2019, 08:54:30 PM »
Thanks, that is a good article. In 64 bit I started with ENTER as it was simple. I later did the alternatives which all work OK but to be honest, it was hard to tell the difference. Almost any high level code like system API code is so much slower then mnemonic code that the creation of a stack frame is trivial in comparison. In Win64 you have many more options, FASTCALL with registers with 4 or less arguments, no stack frame procedures, stack frame procedures and if you need it, aligned procedures with no stack frame.

As always, put your effort into things that matter, with pure assembler algorithms, write them as no stack frame if possible and absolutely kick the guts out of the code to get it up to speed. The rest is folklore.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 9697
  • Assembler is fun ;-)
    • MasmBasic
Re: x86 Machine Code Statistics
« Reply #23 on: February 21, 2019, 09:24:20 PM »
The Intel 80386, part 9: Stack frame instructions

A little test for enter 20h, 10h inspired by the bonus chatter:
Code: [Select]
include \masm32\include\masm32rt.inc
.code

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

EnterTest proc _arg1, _arg2
  push esi
  enter 20h, 10h ; <<<<<<< there it is
  pop esi
  ret 2*4
EnterTest endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

start:
  or edx, -1
  .Repeat
push edx
sub edx, 11111111h
  .Until Zero?
  int 3
  invoke EnterTest, 12345678h, 87654321h
  exit
end start

AW

  • Member
  • *****
  • Posts: 2338
  • Let's Make ASM Great Again!
Re: x86 Machine Code Statistics
« Reply #24 on: February 22, 2019, 03:04:25 AM »
It was from disassembled code,  so no exact correlations between them.
There can be several calls to same subroutine and subroutine can have several ret's.

BTW: what optimizing C/C++ compiler do, just give a try:
x86-64 gcc 8.2 -O2 -mavx2 -ffast-math:
https://gcc.godbolt.org/
Code: [Select]
float scalarproduct(float * array1, float * array2, int length from the caller (20)) {
  float sum = 0.0f;
  for (int i = 0; i < length; ++i) {
    sum += array1[i] * array2[i];
  }
  return sum;
}

x86-64 gcc 8.2 -O2 -mavx2 -ffast-math:
Code: [Select]
scalarproduct proc
        test    edx, edx
        jle     .L4
        lea     ecx, [rdx-1]
        xor     eax, eax
        vxorps  xmm0, xmm0, xmm0
        jmp     .L3
.L5:
        mov     rax, rdx
.L3:
        vmovss  xmm1, DWORD PTR [rdi+rax*4]
        vmulss  xmm1, xmm1, DWORD PTR [rsi+rax*4]
        lea     rdx, [rax+1]
        vaddss  xmm0, xmm0, xmm1
        cmp     rcx, rax
        jne     .L5
        ret
.L4:
        vxorps  xmm0, xmm0, xmm0
        ret
scalarproduct  endp

VS2017 avx2 optimized for size (yeah it snooped the true value of length, which is 20)
Code: [Select]
scalarproduct PROC
vxorps xmm0, xmm0, xmm0
sub rcx, rdx
mov eax, 20
$LL8@scalarprod:
vmovss xmm1, DWORD PTR [rcx+rdx]
vmulss xmm2, xmm1, DWORD PTR [rdx]
lea rdx, QWORD PTR [rdx+4]
vaddss xmm0, xmm0, xmm2
sub rax, 1
jne SHORT $LL8@scalarprod
ret 0
scalarproduct ENDP

VS2017 AVX2 using Fast Floating Point, optimized for speed (again he snooped the value of length):
Code: [Select]
scalarproduct  PROC
vmovups ymm1, YMMWORD PTR [rcx]
vmulps ymm3, ymm1, YMMWORD PTR [rdx]
vmovups ymm1, YMMWORD PTR [rcx+32]
vmulps ymm1, ymm1, YMMWORD PTR [rdx+32]
vaddps ymm0, ymm1, ymm3
vhaddps ymm1, ymm0, ymm0
vhaddps ymm2, ymm1, ymm1
vextractf128 xmm0, ymm2, 1
vaddps xmm4, xmm0, xmm2
vmovss xmm0, DWORD PTR [rdx+68]
vmulss xmm3, xmm0, DWORD PTR [rcx+68]
vmovss xmm2, DWORD PTR [rcx+64]
vmovss xmm0, DWORD PTR [rcx+72]
vmulss xmm1, xmm0, DWORD PTR [rdx+72]
vfmadd132ss xmm2, xmm4, DWORD PTR [rdx+64]
vaddss xmm2, xmm3, xmm2
vaddss xmm3, xmm2, xmm1
vmovss xmm2, DWORD PTR [rcx+76]
vmulss xmm0, xmm2, DWORD PTR [rdx+76]
vaddss xmm0, xmm3, xmm0
vzeroupper
ret 0
scalarproduct  ENDP


Anyway the first is built for Linux the others for Windows. No direct comparison.