News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

push/pop vs mov?

Started by alikim, September 10, 2017, 10:57:40 PM

Previous topic - Next topic

alikim

I'm looking at disassembly of a program and I see a lot of instructions like (with nothing in between)
push 08
pop eax
what is the reason behind that? Is it more effective than mov eax, 08?

hutch--

The push/pop is slower, its useful for memory to memory copy in non critical operations but direct register write is faster for a couple of reasons, its a single opcode and one memory operation is faster than two.

alikim


jj2007

What you see in the disassembly is the result of the m2m macro. It is indeed slower:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
71 ms for push 8, pop eax
35 ms for mov eax, 8

74 ms for push 8, pop eax
33 ms for mov eax, 8

79 ms for push 8, pop eax
33 ms for mov eax, 8

79 ms for push 8, pop eax
33 ms for mov eax, 8

77 ms for push 8, pop eax
33 ms for mov eax, 8

79 ms for push 8, pop eax
34 ms for mov eax, 8

79 ms for push 8, pop eax
33 ms for mov eax, 8

79 ms for push 8, pop eax
33 ms for mov eax, 8

79 ms for push 8, pop eax
33 ms for mov eax, 8

77 ms for push 8, pop eax
33 ms for mov eax, 8

Results are for 100 Million iterations


So if you have code that does one-hundred Million iterations in a loop, and you don't want to wait 40 milliseconds more, you should use the longer mov eax, 8 instruction. Testbed is attached.

alikim


jimg

Apparently, it depends upon the computer, system, and other factors -

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
25 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

25 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
24 ms for mov eax, 8

24 ms for push 8, pop eax
25 ms for mov eax, 8

25 ms for push 8, pop eax
25 ms for mov eax, 8

Results are for 100 Million iterations

jj2007

Interesting. push/pop involves transfer from mem to mem, while mov eax, 8 doesn't. And still the same timings!

Here is a version that tests m2m and mrm, too:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
69 ms for push 8, pop eax
36 ms for mov eax, 8
82 ms for m2m dest, src
36 ms for mrm dest, src

72 ms for push 8, pop eax
35 ms for mov eax, 8
77 ms for m2m dest, src
35 ms for mrm dest, src

70 ms for push 8, pop eax
34 ms for mov eax, 8
78 ms for m2m dest, src
35 ms for mrm dest, src

70 ms for push 8, pop eax
34 ms for mov eax, 8
80 ms for m2m dest, src
38 ms for mrm dest, src

70 ms for push 8, pop eax
34 ms for mov eax, 8
80 ms for m2m dest, src
36 ms for mrm dest, src

Results are for 100 Million iterations

jimg


Since I got this machine last spring, all of my timing tests have been basically unusable.  Windows 10 pro.

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
25 ms for push 8, pop eax
24 ms for mov eax, 8
48 ms for m2m dest, src
24 ms for mrm dest, src

24 ms for push 8, pop eax
24 ms for mov eax, 8
48 ms for m2m dest, src
25 ms for mrm dest, src

25 ms for push 8, pop eax
24 ms for mov eax, 8
49 ms for m2m dest, src
25 ms for mrm dest, src

24 ms for push 8, pop eax
24 ms for mov eax, 8
48 ms for m2m dest, src
24 ms for mrm dest, src

24 ms for push 8, pop eax
24 ms for mov eax, 8
48 ms for m2m dest, src
25 ms for mrm dest, src

Results are for 100 Million iterations

hutch--

The newer hardware keeps being different to the older stuff. This 6 core Haswell I am using has to be woken up out of Noddy mode idling at about 1.2 gig and get it up to the 3.3 gig it is supposed to top at to get anything like reliable timings and the only solution I have ever found is to run the benchmark for longer, at least a second or so. It is not always easy to design a test and doing very close loops on a narrow range of instructions can give you very unreliable results.

What I prefer to do is design a block that has a wider range of instructions then in the middle of that block, test the actual mnemonic code you are interested in and clock the difference if any. Some of the older distinctions don't work that same way any longer but is is very hardware specific to processor models. LEA was a dud on a PIV, came good again on the Core2 series and is OK on both my old i7 and the current Haswell. This is very useful in doing simple multiplication of small numbers.

Siekmanski

This is also the case for the AVX execution unit.

Some high-end Intel processors are able to turn off the upper 128 bits of the 256 bit execution units in order to save power when they are not used.
It takes approximately 14 µs to power up this upper half after an idle period.
The throughput of 256-bit vector instructions is much lower during this warm-up period because the processor uses the lower 128-bit units twice to execute a 256-bit operation.
It is possible to make the 256-bit units warm up in advance by executing a dummy 256-bit instruction at a suitable time before the 256-bit unit is needed.
The upper half of the 256-bit units will be turned off again after approximately 675 µs of no 256-bit instructions.

This phenomenon is described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Creative coders use backward thinking techniques as a strategy.

Mikl__

#10
There is a problem  in reducing the size of the program,  probably
mov eax,8 = B808000000h = 5 bytes
push 8 = 6A08h = 2 bytes
pop eax = 58h = 1 byte
2+1 < 5 is not it?

alikim

For a program with a small stack, the stack is probably stored in CPU cache, so push/pop shouldn't be much slower than mov.

felipe

Quote from: alikim on September 12, 2017, 03:06:09 PM
For a program with a small stack, the stack is probably stored in CPU cache, so push/pop shouldn't be much slower than mov.

This would depend of the way the program is executed, i guess. I.e. if has been in memory, for how long, or how many times was executed or similar. But to have consideration of the size of the program can be important too.  :bgrin:

jj2007

Quote from: felipe on September 13, 2017, 11:15:43 AM
Quote from: alikim on September 12, 2017, 03:06:09 PM
For a program with a small stack, the stack is probably stored in CPU cache, so push/pop shouldn't be much slower than mov.

This would depend of the way the program is executed, i guess. I.e. if has been in memory, for how long, or how many times was executed or similar. But to have consideration of the size of the program can be important too.  :bgrin:

For me, it's a mystery that push 8, pop eax can be as fast as mov eax, 8, because the  former instruction requires two memory accesses. One could argue that the cpu data cache can be as fast as a register move, but still, there are two moves involved. Besides, at least theoretically another thread in another core could grab the memory right in the moment when the 8 was in that precise memory location, so somehow it must be available in real memory, or does the other thread get it from cache, too? ::)

Re size, it may not seem important, but every byte wasted pollutes the instruction cache. If you have a size-optimised subroutine that has 63 bytes, and you replace one m2m with a mov eax, imm, you are at 65 bytes, and the processor needs two cache lines instead of one. That may affect performance, but you will hardly be able to measure that with a hello world proggie. If the program gets huge, though, that programming style will have an impact on overall performance. In disassembled compiler code, I often see mov eax, 0 or mov eax, -1 - lousy implementations, even the Intel manuals recommend xor reg, reg to zero a register. And or reg, -1 is shorter than mov reg, -1 but equally fast.