Without source code it's hard to tell, but if you are using "mov ebx,OFFSET var" that is longer than "lea ebx,[var]" (4 bytes extra?).
With ML64, using lea will use rip-relative addressing, the only catch is that var needs to be within +-2GB.
This allows for proper position independent code too.
Another cause of bloat is adjusting the stack for each API call, can end up as code like
sub rsp,20h
call API_1
add rsp,20h
sub rsp,20h
call API_2
add rsp,20h
sub rsp,20h
call API_3
add rsp,20h
...
If you write your own prologue you can pass "maximum param bytes" in the PROC declaration and adjust the stack once.
One big gotcha is what happens to the upper 32 bits of a register when you manipulate the lower 32 bits.
"sub eax,eax" will zero the top 32 bits of rax. So "sub eax,eax" is the same as "sub rax,rax" except one byte smaller (no rex prefix).
A common way to get two 16-bit numbers into a 32-bit register, extended to 32 and 64
mov ax,high16
shl eax,16
mov ax,dx
;
mov eax,high32
shl rax,32
mov eax,edx ;oops, high32 of rax now 0