Succesive Writes vs PUSHes

hutch-- · June 14, 2017, 12:21:01 PM

The thing going against using push/pop in 64 bit is alignment. It may be a familiar technique from 32 bit and earlier but with the Microsoft ABI, you are then stuck with manual stack twiddling to get procedures to work. The 64 bit alignment of arguments written to the stack after using the first 4 registers in the ABI can handle BYTE, WORD, DWORD and QWORD without having to change the RSP stack pointer as each is written to a stack memory location that is 64 bit aligned.

It may be character building to play with manual stack adjustments just to get a procedure to run but if reliable code is the target, the last thing you want is pissing around aligning the stack just to get a procedure to run. For slightly more typing, if you create a local for each register you need to preserve and later restore and copy the content into the register on the way in and vice versa on the way out, you get direct register / memory writes both ways without messing up the stack. It looks like this.

LOCAL .r12 :QWORD
LOCAL .r13 :QWORD
LOCAL .r14 :QWORD
LOCAL .r15 :QWORD

mov .r12, r12
mov .r13, r13
mov .r14, r14
mov .r15, r15

; socket 2 'em

mov r12, .r12
mov r13, .r13
mov r14, .r14
mov r15, .r15

ret

Ah look mum, no stack twiddling. :P

sinsi · June 14, 2017, 01:49:34 PM

Don't forget that in 64-bit you can use the spill space for any register storage, not just RCX/RDX etc.
I have seen Windows APIs store RBX (and even XMM0) at [rsp+8].

My main bugbear with "mov [rsp+xx]" is the code size, e.g. CreateWindowEx has 12 parameters, using push cuts down on code size.
Not so much of a problem in 32-bit land.

coder · June 14, 2017, 02:31:29 PM

Quote from: sinsi on June 14, 2017, 01:49:34 PM
Don't forget that in 64-bit you can use the spill space for any register storage, not just RCX/RDX etc.
I have seen Windows APIs store RBX (and even XMM0) at [rsp+8].

My main bugbear with "mov [rsp+xx]" is the code size, e.g. CreateWindowEx has 12 parameters, using push cuts down on code size.
Not so much of a problem in 32-bit land.

I am wondering where the segfaults are originating from in case of some random / unintentional stack misalignment upon entries. I have this suspicion that most 64-bit API's entry codes use SSE/AVX for internal maintenance. I can't think of anything else that could trigger such faults.

coder · June 14, 2017, 02:44:04 PM

Thanks people for testing it out on your machines.
I was actually investigating the 64-bit codes but forgot that I was on a 32-bit Win 7 the last session, hence the 32-bit attachment as a temporary proof-of-concept code.

My 64-bit AMD PC (64-bit code) shows that after 7 back-to-back writes, MOV performs way poorer than single-byte PUSH in terms of speed. Maybe you can investigate more and share your findings. Thanks.

coder · June 14, 2017, 02:45:53 PM

Quote from: hutch-- on June 14, 2017, 12:21:01 PM
The thing going against using push/pop in 64 bit is alignment. It may be a familiar technique from 32 bit and earlier but with the Microsoft ABI, you are then stuck with manual stack twiddling to get procedures to work. The 64 bit alignment of arguments written to the stack after using the first 4 registers in the ABI can handle BYTE, WORD, DWORD and QWORD without having to change the RSP stack pointer as each is written to a stack memory location that is 64 bit aligned.

It may be character building to play with manual stack adjustments just to get a procedure to run but if reliable code is the target, the last thing you want is pissing around aligning the stack just to get a procedure to run. For slightly more typing, if you create a local for each register you need to preserve and later restore and copy the content into the register on the way in and vice versa on the way out, you get direct register / memory writes both ways without messing up the stack. It looks like this.

LOCAL .r12 :QWORD
LOCAL .r13 :QWORD
LOCAL .r14 :QWORD
LOCAL .r15 :QWORD

mov .r12, r12
mov .r13, r13
mov .r14, r14
mov .r15, r15

; socket 2 'em

mov r12, .r12
mov r13, .r13
mov r14, .r14
mov r15, .r15

ret

Ah look mum, no stack twiddling. :P

:t

coder · June 14, 2017, 03:39:54 PM

Since I am on Linux server right now, can't write MASM code. Here's a quick equivalent test code (64-bit) in NASM syntax for Linux.

A colleague help me tested it on different PC, with completely different results - MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes. I don't know...

Code Select

admin@mint ~/nasm $ ./time64
PUSH = 425
MOV  = 284
admin@mint ~/nasm $ ./time64
PUSH = 423
MOV  = 282
admin@mint ~/nasm $ ./time64
PUSH = 426
MOV  = 281
admin@mint ~/nasm $ ./time64
PUSH = 419
MOV  = 284
admin@mint ~/nasm $ ./time64
PUSH = 426
MOV  = 282
admin@mint ~/nasm $ ./time64
PUSH = 423
MOV  = 284

hutch-- · June 14, 2017, 04:29:31 PM

> MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes.

This is normally the case, "mov" just transfers data where "push" must transfer data AND update the stack pointer.

jj2007 · June 14, 2017, 05:35:56 PM

Quote from: coder on June 14, 2017, 03:39:54 PMon different PC, with completely different results

Without indicating which CPU was used, this is not a very interesting result. What counts are CPU and the type of memory used.

Quote from: hutch-- on June 14, 2017, 04:29:31 PM"mov" just transfers data where "push" must transfer data AND update the stack pointer.

This is also true for rep stosd. Real world results depend on how the CPU handles that task. With the exception of that Linux result, every case so far has shown that push and mov are equally fast. Since the bottle nick is memory access, and not the adjustment of an internal CPU register that probably is handled in parallel, every other outcome would have been surprising.
All other

FORTRANS · June 15, 2017, 01:23:11 AM

Hi Jochen,

Quote from: jj2007 on June 14, 2017, 08:22:20 AM
Steve, I rarely use the timeit >results.txt version, but I wouldn't have suggested that without testing it OK, on Win7-64. Now I have also tested it on WinXP SP3 and Win 10.0.15063 - no problems. Which Windows version are you using?

Well I tested it with Windows 2000. I went and tested it on
Windows 98, where it also failed. I tested it with Windows XP
and it does work as expected.

I tried reassembling it, as that helps sometimes. But no change
in results this time. Executable files are the same size. But a
file compare program shows differences. ML 6.14 and Polink
were used.

Regards,

Steve N.

coder · June 15, 2017, 03:33:42 PM

Quote from: hutch-- on June 14, 2017, 04:29:31 PM
> MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes.

This is normally the case, "mov" just transfers data where "push" must transfer data AND update the stack pointer.

I think PUSH does more than that. Looking at the PUSH definition, I see lots of other chores involved. Worse with POP. The microcode involved are really not necessary. OTOH, MOV can take full advantage of out-of-order execution. This is especially true in 64-bit hence I see major players come up with their own calling conventions avoiding PUSH/POP or anything that moves the stack. But I have mixed results with 32-bit.

hutch-- · June 15, 2017, 04:33:10 PM

The only real problem with Win32 is the lack of registers in comparison to win64. If you don't mind non standard calling techniques and the argument count can fit into 3 registers, eax, ecx and edx, you can set a low calling overhead version of fastcall that frees you of all the overhead associated with push/call notation. If you still run a stack frame you can use locals for any extra registers you need to preserve but as usual, you have to weight the advantage of a lower overhead calling system as it becomes a point of diminishing returns as the procedure you call gets longer.

Once you call an operating system function, any advantage goes out the window and the real gain with low calling overhead techniques is with very short leaf procedures where you can actually see a speed gain by dropping the calling overhead. It is also the case if you don't mind cluttering the caller code that inlining a small algo is faster again.

It may be worth the effort to look up the effects of running 32 bit code on a 64 bit processor as it is run in "legacy" mode and may effect the efficiency of the 32 bit code.

jj2007 · June 15, 2017, 05:08:36 PM

Quote from: coder on June 15, 2017, 03:33:42 PMLooking at the PUSH definition, I see lots of other chores involved. Worse with POP.

More precisely? And how is that different to rep movsd, the fastest way to copy values mem to mem?

coder · June 15, 2017, 05:49:16 PM

Quote from: jj2007 on June 15, 2017, 05:08:36 PM
Quote from: coder on June 15, 2017, 03:33:42 PMLooking at the PUSH definition, I see lots of other chores involved. Worse with POP.

More precisely? And how is that different to rep movsd, the fastest way to copy values mem to mem?

We are talking about stack in much more general sense, not specific or selected processors. Whether a processor implements a dedicated circuitry to make certain instructions fast (like rep movsd), is really not my concern at this moment.

coder · June 15, 2017, 05:59:36 PM

Quote from: hutch-- on June 15, 2017, 04:33:10 PMIt may be worth the effort to look up the effects of running 32 bit code on a 64 bit processor as it is run in "legacy" mode and may effect the efficiency of the 32 bit code.

Many people still don't get the idea that running legacy CISC-inspired instructions don't work quite well on modern 64-bit CPUs. Or the manufacturers have to dedicate a special implementations (circuitry/techniques) of such legacy instructions to be at least on par with RISC based instructions in terms of speed and power consumption.

jj2007 · June 15, 2017, 06:02:03 PM

QuoteWe are talking about stack in much more general sense, not specific or selected processors. Whether a processor implements a dedicated circuitry to make certain instructions fast (like rep movsd), is really not my concern at this moment.

Thank you, that explains it all, of course. And why do our results show that push/pop and mov are equally fast?

The MASM Forum

News:

Succesive Writes vs PUSHes

hutch--

sinsi

coder

coder

coder

coder

hutch--

jj2007

FORTRANS

coder

hutch--

jj2007

coder

coder

jj2007