The MASM Forum

General => The Laboratory => Topic started by: coder on June 13, 2017, 12:59:16 PM

Title: Succesive Writes vs PUSHes
Post by: coder on June 13, 2017, 12:59:16 PM
I wonder how may successive memory writes can be considered fast before it hits something hard, if compared to multiple PUSHes. For example;

Code: [Select]
sub rsp,20h
mov [rsp],reg-a
...
mov [rsp+n*8],reg-n

vs

Code: [Select]
push reg-a
...
push reg-n

My test code says 6 where it becomes (much) much slower than 6 successive PUSHes.

Can we confirm this?

Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 13, 2017, 01:53:33 PM
sorry, i thought I was on 64-bit. This test code is for 32-bit win. Add more MOV and PUSH to test beyond 7.



Title: Re: Succesive Writes vs PUSHes
Post by: aw27 on June 13, 2017, 03:12:57 PM
sorry, i thought I was on 64-bit. This test code is for 32-bit win. Add more MOV and PUSH to test beyond 7.
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 13, 2017, 08:27:00 PM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Indeed, this has many things to do with the 64-bit assembly. For example, it can explain the peculiar designs of MS 64-bit ABI, microcodes, RISC vs CISC and stuff. I am not really into calling conventions and high-level stuff.
Title: Re: Succesive Writes vs PUSHes
Post by: sinsi on June 13, 2017, 08:28:15 PM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Of course you can. If you follow the ABI your proc starts with a stack alignment of 8. Push a register, bingo you are aligned to 16.
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 13, 2017, 08:46:49 PM
Now my tests say 7 successive writes. Beyond that, you're better off using PUSHes all the way. The speed gains/losses is significantly large.


 

Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 13, 2017, 11:03:14 PM
Interesting test! I've extended it a little bit:
Code: [Select]
1*2*6
MOV  = 422 ticks
PUSH = 405 ticks

1*2*6
MOV  = 406 ticks
PUSH = 405 ticks

2*2*6
MOV  = 593 ticks
PUSH = 593 ticks

2*2*6
MOV  = 593 ticks
PUSH = 593 ticks

4*2*6
MOV  = 1107 ticks
PUSH = 1030 ticks

8*2*6
MOV  = 1872 ticks
PUSH = 1856 ticks

16*2*6
MOV  = 3635 ticks
PUSH = 3635 ticks

32*2*6
MOV  = 7301 ticks
PUSH = 7347 ticks

Results are remarkably similar. This is on a Core i5, I wonder whether there are any differences on older CPUs.

Pure Masm32 attached. I've modified the source a little bit so that it assembles on a standard Masm32 installation.
Title: Re: Succesive Writes vs PUSHes
Post by: TWell on June 13, 2017, 11:28:16 PM
AMD E1
Code: [Select]
1*2*6

MOV  = 1344 ticks
PUSH = 1328 ticks

1*2*6

MOV  = 1359 ticks
PUSH = 1375 ticks

2*2*6

MOV  = 1891 ticks
PUSH = 1843 ticks

2*2*6

MOV  = 1907 ticks
PUSH = 1828 ticks

4*2*6

MOV  = 3000 ticks
PUSH = 2906 ticks

8*2*6

MOV  = 5172 ticks
PUSH = 5000 ticks

16*2*6

MOV  = 9406 ticks
PUSH = 9172 ticks

32*2*6

MOV  = 19516 ticks
PUSH = 19234 ticks
Title: Re: Succesive Writes vs PUSHes
Post by: felipe on June 13, 2017, 11:57:17 PM
I did the test, but i wasn't able to copy the output  :(. The results were in the middle of the two above, with the final time around the 10000 ticks. I got a Celeron  :redface:.
Title: Re: Succesive Writes vs PUSHes
Post by: aw27 on June 14, 2017, 02:45:44 AM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Of course you can. If you follow the ABI your proc starts with a stack alignment of 8. Push a register, bingo you are aligned to 16.
I mean that you must know exactly what you are doing to keep the pushes balanced with the alignment. Also, in many cases the alignment is not important even if you call a function. However, this is not a good programming practice, but we can always play with bad programming practices and see what happens and learn by trial and error. :lol:
Title: Re: Succesive Writes vs PUSHes
Post by: FORTRANS on June 14, 2017, 04:21:08 AM
Hi Jochen,

I wonder whether there are any differences on older CPUs.

Code: [Select]
P-III

1*2*6
MOV  = 3225 ticks
PUSH = 2864 ticks

1*2*6
MOV  = 3235 ticks
PUSH = 2914 ticks

2*2*6
MOV  = 4266 ticks
PUSH = 4707 ticks

2*2*6
MOV  = 4316 ticks
PUSH = 4426 ticks

4*2*6
MOV  = 7531 ticks
PUSH = 8262 ticks

8*2*6
MOV  = 11797 ticks
PUSH = 11627 ticks

16*2*6
MOV  = 20930 ticks
PUSH = 20119 ticks

32*2*6
MOV  = 40478 ticks
PUSH = 38896 ticks

P-MMX

A:\>timeit
1*2*6
MOV  = 14684 ticks
PUSH = 10795 ticks

1*2*6
MOV  = 12925 ticks
PUSH = 10785 ticks

2*2*6
MOV  = 18535 ticks
PUSH = 14505 ticks

2*2*6
MOV  = 18510 ticks
PUSH = 14495 ticks

4*2*6
MOV  = 26685 ticks
PUSH = 21895 ticks

8*2*6
MOV  = 48355 ticks
PUSH = 41935 ticks

16*2*6
MOV  = 85131 ticks
PUSH = 69705 ticks

32*2*6
MOV  = 159150 ticks
PUSH = 128936 ticks

Pentium M

1*2*6
MOV  = 1402 ticks
PUSH = 1232 ticks

1*2*6
MOV  = 1202 ticks
PUSH = 1212 ticks

2*2*6
MOV  = 1682 ticks
PUSH = 1713 ticks

2*2*6
MOV  = 1682 ticks
PUSH = 1722 ticks

4*2*6
MOV  = 2704 ticks
PUSH = 2644 ticks

8*2*6
MOV  = 4677 ticks
PUSH = 4867 ticks

16*2*6
MOV  = 8592 ticks
PUSH = 8552 ticks

32*2*6
MOV  = 16434 ticks
PUSH = 15893 ticks

   You should consider having the program output redirectable
to a file for ease of use.

HTH,

Steve N.
Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 14, 2017, 05:51:46 AM
You should consider having the program output redirectable to a file for ease of use.

Thanks, Steve. From a DOS prompt:
Code: [Select]
timeit >results.txt
results.txt
Title: Re: Succesive Writes vs PUSHes
Post by: FORTRANS on June 14, 2017, 06:37:56 AM
Hi,

   Well...

F:\TEMP\TEST>timeit > a:x

F:\TEMP\TEST>dir a: /od
 Volume in drive A is TRANSPORT
 Volume Serial Number is E6D5-D015

 Directory of A:\

{Snip}
13-06-17  01:13p                 1,765 TIMEIT.TXT
13-06-17  03:00p                 3,072 TIMEIT.EXE
13-06-17  03:20p                     0 x
              16 File(s)        404,460 bytes
               2 Dir(s)         608,256 bytes free


   That doesn't seem to work.  (And it does work for DIR > y and
most other programs.)  I had to cut and paste the results from
the screen.

Cheers,

Steve N.
Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 14, 2017, 08:22:20 AM
   That doesn't seem to work.  (And it does work for DIR > y and
most other programs.)  I had to cut and paste the results from
the screen.

Steve, I rarely use the timeit >results.txt version, but I wouldn't have suggested that without testing it OK, on Win7-64. Now I have also tested it on WinXP SP3 and Win 10.0.15063 - no problems. Which Windows version are you using?
Title: Re: Succesive Writes vs PUSHes
Post by: felipe on June 14, 2017, 11:45:54 AM
I did the test, but i wasn't able to copy the output  :(. The results were in the middle of the two above, with the final time around the 10000 ticks. I got a Celeron  :redface:.

Code: [Select]
1*2*6
MOV  = 1109 ticks
PUSH = 1031 ticks

1*2*6
MOV  = 1141 ticks
PUSH = 1000 ticks

2*2*6
MOV  = 1313 ticks
PUSH = 1265 ticks

2*2*6
MOV  = 1344 ticks
PUSH = 1250 ticks

4*2*6
MOV  = 1734 ticks
PUSH = 1875 ticks

8*2*6
MOV  = 2828 ticks
PUSH = 3157 ticks

16*2*6
MOV  = 5984 ticks
PUSH = 6578 ticks

32*2*6
MOV  = 10547 ticks


 :t
Title: Re: Succesive Writes vs PUSHes
Post by: hutch-- on June 14, 2017, 12:21:01 PM
The thing going against using push/pop in 64 bit is alignment. It may be a familiar technique from 32 bit and earlier but with the Microsoft ABI, you are then stuck with manual stack twiddling to get procedures to work. The 64 bit alignment of arguments written to the stack after using the first 4 registers in the ABI can handle BYTE, WORD, DWORD and QWORD without having to change the RSP stack pointer as each is written to a stack memory location that is 64 bit aligned.

It may be character building to play with manual stack adjustments just to get a procedure to run but if reliable code is the target, the last thing you want is pissing around aligning the stack just to get a procedure to run. For slightly more typing, if you create a local for each register you need to preserve and later restore and copy the content into the register on the way in and vice versa on the way out, you get direct register / memory writes both ways without messing up the stack. It looks like this.

LOCAL .r12 :QWORD
LOCAL .r13 :QWORD
LOCAL .r14 :QWORD
LOCAL .r15 :QWORD

mov .r12, r12
mov .r13, r13
mov .r14, r14
mov .r15, r15

; socket 2 'em

mov r12, .r12
mov r13, .r13
mov r14, .r14
mov r15, .r15

ret

Ah look mum, no stack twiddling.  :P


Title: Re: Succesive Writes vs PUSHes
Post by: sinsi on June 14, 2017, 01:49:34 PM
Don't forget that in 64-bit you can use the spill space for any register storage, not just RCX/RDX etc.
I have seen Windows APIs store RBX (and even XMM0) at [rsp+8].

My main bugbear with "mov [rsp+xx]" is the code size, e.g. CreateWindowEx has 12 parameters, using push cuts down on code size.
Not so much of a problem in 32-bit land.
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 14, 2017, 02:31:29 PM
Don't forget that in 64-bit you can use the spill space for any register storage, not just RCX/RDX etc.
I have seen Windows APIs store RBX (and even XMM0) at [rsp+8].

My main bugbear with "mov [rsp+xx]" is the code size, e.g. CreateWindowEx has 12 parameters, using push cuts down on code size.
Not so much of a problem in 32-bit land.

I am wondering where the segfaults are originating from in case of some random / unintentional stack misalignment upon entries. I have this suspicion that most 64-bit API's entry codes use SSE/AVX for internal maintenance. I can't think of anything else that could trigger such faults.

 
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 14, 2017, 02:44:04 PM
Thanks people for testing it out on your machines.
I was actually investigating the 64-bit codes but forgot that I was on a 32-bit Win 7 the last session, hence the 32-bit attachment as a temporary proof-of-concept code.

My 64-bit AMD PC (64-bit code) shows that after 7 back-to-back writes, MOV performs way poorer than single-byte PUSH in terms of speed. Maybe you can investigate more and share your findings. Thanks.
 

 
 
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 14, 2017, 02:45:53 PM
The thing going against using push/pop in 64 bit is alignment. It may be a familiar technique from 32 bit and earlier but with the Microsoft ABI, you are then stuck with manual stack twiddling to get procedures to work. The 64 bit alignment of arguments written to the stack after using the first 4 registers in the ABI can handle BYTE, WORD, DWORD and QWORD without having to change the RSP stack pointer as each is written to a stack memory location that is 64 bit aligned.

It may be character building to play with manual stack adjustments just to get a procedure to run but if reliable code is the target, the last thing you want is pissing around aligning the stack just to get a procedure to run. For slightly more typing, if you create a local for each register you need to preserve and later restore and copy the content into the register on the way in and vice versa on the way out, you get direct register / memory writes both ways without messing up the stack. It looks like this.

LOCAL .r12 :QWORD
LOCAL .r13 :QWORD
LOCAL .r14 :QWORD
LOCAL .r15 :QWORD

mov .r12, r12
mov .r13, r13
mov .r14, r14
mov .r15, r15

; socket 2 'em

mov r12, .r12
mov r13, .r13
mov r14, .r14
mov r15, .r15

ret

Ah look mum, no stack twiddling.  :P

 :t
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 14, 2017, 03:39:54 PM
Since I am on Linux server right now, can't write MASM code. Here's a quick equivalent test code (64-bit) in NASM syntax for Linux.

A colleague help me tested it on different PC, with completely different results - MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes. I don't know...

Code: [Select]
admin@mint ~/nasm $ ./time64
PUSH = 425
MOV  = 284
admin@mint ~/nasm $ ./time64
PUSH = 423
MOV  = 282
admin@mint ~/nasm $ ./time64
PUSH = 426
MOV  = 281
admin@mint ~/nasm $ ./time64
PUSH = 419
MOV  = 284
admin@mint ~/nasm $ ./time64
PUSH = 426
MOV  = 282
admin@mint ~/nasm $ ./time64
PUSH = 423
MOV  = 284
Title: Re: Succesive Writes vs PUSHes
Post by: hutch-- on June 14, 2017, 04:29:31 PM
> MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes.

This is normally the case, "mov" just transfers data where "push" must transfer data AND update the stack pointer.
Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 14, 2017, 05:35:56 PM
on different PC, with completely different results

Without indicating which CPU was used, this is not a very interesting result. What counts are CPU and the type of memory used.

"mov" just transfers data where "push" must transfer data AND update the stack pointer.

This is also true for rep stosd. Real world results depend on how the CPU handles that task. With the exception of that Linux result, every case so far has shown that push and mov are equally fast. Since the bottle nick is memory access, and not the adjustment of an internal CPU register that probably is handled in parallel, every other outcome would have been surprising.
All other
Title: Re: Succesive Writes vs PUSHes
Post by: FORTRANS on June 15, 2017, 01:23:11 AM
Hi Jochen,

Steve, I rarely use the timeit >results.txt version, but I wouldn't have suggested that without testing it OK, on Win7-64. Now I have also tested it on WinXP SP3 and Win 10.0.15063 - no problems. Which Windows version are you using?

   Well I tested it with Windows 2000.  I went and tested it on
Windows 98, where it also failed.  I tested it with Windows XP
and it does work as expected.

   I tried reassembling it, as that helps sometimes.  But no change
in results this time.  Executable files are the same size.  But a
file compare program shows differences.  ML 6.14 and Polink
were used.

Regards,

Steve N.
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 15, 2017, 03:33:42 PM
> MOV is faster all the way doesn't matter how many writes are performed if compared to PUSHes.

This is normally the case, "mov" just transfers data where "push" must transfer data AND update the stack pointer.
I think PUSH does more than that. Looking at the PUSH definition, I see lots of other chores involved. Worse with POP. The microcode involved are really not necessary. OTOH, MOV can take full advantage of out-of-order execution. This is especially true in 64-bit hence I see major players come up with their own calling conventions avoiding PUSH/POP or anything that moves the stack. But I have mixed results with 32-bit.

Title: Re: Succesive Writes vs PUSHes
Post by: hutch-- on June 15, 2017, 04:33:10 PM
The only real problem with Win32 is the lack of registers in comparison to win64. If you don't mind non standard calling techniques and the argument count can fit into 3 registers, eax, ecx and edx, you can set a low calling overhead version of fastcall that frees you of all the overhead associated with push/call notation. If you still run a stack frame you can use locals for any extra registers you need to preserve but as usual, you have to weight the advantage of a lower overhead calling system as it becomes a point of diminishing returns as the procedure you call gets longer.

Once you call an operating system function, any advantage goes out the window and the real gain with low calling overhead techniques is with very short leaf procedures where you can actually see a speed gain by dropping the calling overhead. It is also the case if you don't mind cluttering the caller code that inlining a small algo is faster again.

It may be worth the effort to look up the effects of running 32 bit code on a 64 bit processor as it is run in "legacy" mode and may effect the efficiency of the 32 bit code.
Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 15, 2017, 05:08:36 PM
Looking at the PUSH definition, I see lots of other chores involved. Worse with POP.

More precisely? And how is that different to rep movsd, the fastest way to copy values mem to mem?
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 15, 2017, 05:49:16 PM
Looking at the PUSH definition, I see lots of other chores involved. Worse with POP.

More precisely? And how is that different to rep movsd, the fastest way to copy values mem to mem?
We are talking about stack in much more general sense, not specific or selected processors. Whether a processor implements a dedicated circuitry to make certain instructions fast (like rep movsd), is really not my concern at this moment.
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 15, 2017, 05:59:36 PM
It may be worth the effort to look up the effects of running 32 bit code on a 64 bit processor as it is run in "legacy" mode and may effect the efficiency of the 32 bit code.
Many people still don't get the idea that running legacy CISC-inspired instructions don't work quite well on modern 64-bit CPUs. Or the manufacturers have to dedicate a special implementations (circuitry/techniques) of such legacy instructions to be at least on par with RISC based instructions in terms of speed and power consumption.




Title: Re: Succesive Writes vs PUSHes
Post by: jj2007 on June 15, 2017, 06:02:03 PM
Quote
We are talking about stack in much more general sense, not specific or selected processors. Whether a processor implements a dedicated circuitry to make certain instructions fast (like rep movsd), is really not my concern at this moment.

Thank you, that explains it all, of course. And why do our results show that push/pop and mov are equally fast?
Title: Re: Succesive Writes vs PUSHes
Post by: coder on June 15, 2017, 06:18:01 PM
Quote
We are talking about stack in much more general sense, not specific or selected processors. Whether a processor implements a dedicated circuitry to make certain instructions fast (like rep movsd), is really not my concern at this moment.

Thank you, that explains it all, of course. And why do our results show that push/pop and mov are equally fast?
Codes are not equal. Your modified test code is more complicated and too high-level than it should. When I asked for help with confirmation, I meant the codes should be in equal settings, running on different CPUs, or else the results are invalid. Secondly, certain processors do apply special implementations when dealing with the stack, for example LSD (look it up), stack engines etc. The purpose however is not to make it fast, but to make it rather on par with simple RISC-based instructions. Your results are showing all that - pound-for-pound, a 1 byte PUSH is having some difficulties in catching up with multiple bytes MOV.