News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Succesive Writes vs PUSHes

Started by coder, June 13, 2017, 12:59:16 PM

Previous topic - Next topic

coder

I wonder how may successive memory writes can be considered fast before it hits something hard, if compared to multiple PUSHes. For example;

sub rsp,20h
mov [rsp],reg-a
...
mov [rsp+n*8],reg-n


vs

push reg-a
...
push reg-n


My test code says 6 where it becomes (much) much slower than 6 successive PUSHes.

Can we confirm this?


coder

sorry, i thought I was on 64-bit. This test code is for 32-bit win. Add more MOV and PUSH to test beyond 7.




aw27

Quote from: coder on June 13, 2017, 01:53:33 PM
sorry, i thought I was on 64-bit. This test code is for 32-bit win. Add more MOV and PUSH to test beyond 7.
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).

coder

Quote from: aw27 on June 13, 2017, 03:12:57 PM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Indeed, this has many things to do with the 64-bit assembly. For example, it can explain the peculiar designs of MS 64-bit ABI, microcodes, RISC vs CISC and stuff. I am not really into calling conventions and high-level stuff.

sinsi

Quote from: aw27 on June 13, 2017, 03:12:57 PM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Of course you can. If you follow the ABI your proc starts with a stack alignment of 8. Push a register, bingo you are aligned to 16.
🍺🍺🍺

coder

Now my tests say 7 successive writes. Beyond that, you're better off using PUSHes all the way. The speed gains/losses is significantly large.





jj2007

Interesting test! I've extended it a little bit:1*2*6
MOV  = 422 ticks
PUSH = 405 ticks

1*2*6
MOV  = 406 ticks
PUSH = 405 ticks

2*2*6
MOV  = 593 ticks
PUSH = 593 ticks

2*2*6
MOV  = 593 ticks
PUSH = 593 ticks

4*2*6
MOV  = 1107 ticks
PUSH = 1030 ticks

8*2*6
MOV  = 1872 ticks
PUSH = 1856 ticks

16*2*6
MOV  = 3635 ticks
PUSH = 3635 ticks

32*2*6
MOV  = 7301 ticks
PUSH = 7347 ticks


Results are remarkably similar. This is on a Core i5, I wonder whether there are any differences on older CPUs.

Pure Masm32 attached. I've modified the source a little bit so that it assembles on a standard Masm32 installation.

TWell

AMD E11*2*6

MOV  = 1344 ticks
PUSH = 1328 ticks

1*2*6

MOV  = 1359 ticks
PUSH = 1375 ticks

2*2*6

MOV  = 1891 ticks
PUSH = 1843 ticks

2*2*6

MOV  = 1907 ticks
PUSH = 1828 ticks

4*2*6

MOV  = 3000 ticks
PUSH = 2906 ticks

8*2*6

MOV  = 5172 ticks
PUSH = 5000 ticks

16*2*6

MOV  = 9406 ticks
PUSH = 9172 ticks

32*2*6

MOV  = 19516 ticks
PUSH = 19234 ticks

felipe

I did the test, but i wasn't able to copy the output  :(. The results were in the middle of the two above, with the final time around the 10000 ticks. I got a Celeron  :redface:.

aw27

Quote from: sinsi on June 13, 2017, 08:28:15 PM
Quote from: aw27 on June 13, 2017, 03:12:57 PM
We must not use pushes on 64-bit because there is a need to keep the alignment (in many cases).
Of course you can. If you follow the ABI your proc starts with a stack alignment of 8. Push a register, bingo you are aligned to 16.
I mean that you must know exactly what you are doing to keep the pushes balanced with the alignment. Also, in many cases the alignment is not important even if you call a function. However, this is not a good programming practice, but we can always play with bad programming practices and see what happens and learn by trial and error. :lol:

FORTRANS

Hi Jochen,

Quote from: jj2007 on June 13, 2017, 11:03:14 PM
I wonder whether there are any differences on older CPUs.

P-III

1*2*6
MOV  = 3225 ticks
PUSH = 2864 ticks

1*2*6
MOV  = 3235 ticks
PUSH = 2914 ticks

2*2*6
MOV  = 4266 ticks
PUSH = 4707 ticks

2*2*6
MOV  = 4316 ticks
PUSH = 4426 ticks

4*2*6
MOV  = 7531 ticks
PUSH = 8262 ticks

8*2*6
MOV  = 11797 ticks
PUSH = 11627 ticks

16*2*6
MOV  = 20930 ticks
PUSH = 20119 ticks

32*2*6
MOV  = 40478 ticks
PUSH = 38896 ticks

P-MMX

A:\>timeit
1*2*6
MOV  = 14684 ticks
PUSH = 10795 ticks

1*2*6
MOV  = 12925 ticks
PUSH = 10785 ticks

2*2*6
MOV  = 18535 ticks
PUSH = 14505 ticks

2*2*6
MOV  = 18510 ticks
PUSH = 14495 ticks

4*2*6
MOV  = 26685 ticks
PUSH = 21895 ticks

8*2*6
MOV  = 48355 ticks
PUSH = 41935 ticks

16*2*6
MOV  = 85131 ticks
PUSH = 69705 ticks

32*2*6
MOV  = 159150 ticks
PUSH = 128936 ticks

Pentium M

1*2*6
MOV  = 1402 ticks
PUSH = 1232 ticks

1*2*6
MOV  = 1202 ticks
PUSH = 1212 ticks

2*2*6
MOV  = 1682 ticks
PUSH = 1713 ticks

2*2*6
MOV  = 1682 ticks
PUSH = 1722 ticks

4*2*6
MOV  = 2704 ticks
PUSH = 2644 ticks

8*2*6
MOV  = 4677 ticks
PUSH = 4867 ticks

16*2*6
MOV  = 8592 ticks
PUSH = 8552 ticks

32*2*6
MOV  = 16434 ticks
PUSH = 15893 ticks


   You should consider having the program output redirectable
to a file for ease of use.

HTH,

Steve N.

jj2007

Quote from: FORTRANS on June 14, 2017, 04:21:08 AMYou should consider having the program output redirectable to a file for ease of use.

Thanks, Steve. From a DOS prompt:
timeit >results.txt
results.txt

FORTRANS

Hi,

   Well...

F:\TEMP\TEST>timeit > a:x

F:\TEMP\TEST>dir a: /od
Volume in drive A is TRANSPORT
Volume Serial Number is E6D5-D015

Directory of A:\

{Snip}
13-06-17  01:13p                 1,765 TIMEIT.TXT
13-06-17  03:00p                 3,072 TIMEIT.EXE
13-06-17  03:20p                     0 x
              16 File(s)        404,460 bytes
               2 Dir(s)         608,256 bytes free


   That doesn't seem to work.  (And it does work for DIR > y and
most other programs.)  I had to cut and paste the results from
the screen.

Cheers,

Steve N.

jj2007

Quote from: FORTRANS on June 14, 2017, 06:37:56 AM
   That doesn't seem to work.  (And it does work for DIR > y and
most other programs.)  I had to cut and paste the results from
the screen.

Steve, I rarely use the timeit >results.txt version, but I wouldn't have suggested that without testing it OK, on Win7-64. Now I have also tested it on WinXP SP3 and Win 10.0.15063 - no problems. Which Windows version are you using?

felipe

Quote from: felipe on June 13, 2017, 11:57:17 PM
I did the test, but i wasn't able to copy the output  :(. The results were in the middle of the two above, with the final time around the 10000 ticks. I got a Celeron  :redface:.


1*2*6
MOV  = 1109 ticks
PUSH = 1031 ticks

1*2*6
MOV  = 1141 ticks
PUSH = 1000 ticks

2*2*6
MOV  = 1313 ticks
PUSH = 1265 ticks

2*2*6
MOV  = 1344 ticks
PUSH = 1250 ticks

4*2*6
MOV  = 1734 ticks
PUSH = 1875 ticks

8*2*6
MOV  = 2828 ticks
PUSH = 3157 ticks

16*2*6
MOV  = 5984 ticks
PUSH = 6578 ticks

32*2*6
MOV  = 10547 ticks



:t