So here is my next question ... I am implementing a Forth system (for fun and to learn MASM), so I need a data stack, which is separate from the return stack. In my implementation, it is just an small part of a big chunk of memory which I have allocated using HeapAlloc.
By convention, I have [ebx] pointing to the start of my allocated memory, and [esi] as the offset into that memory where the stack starts, so [ebx][esi] refers to the start of the stack. In my implementation, the stack pointer (depth counter) is the first cell, followed by the data. So to push a value onto the stack, I increment the depth counter and then place the value at [start+depth]. 32-bit stacks need to move 4 addresses at a time, hence the SHL ecx, 2.
Note that counted strings are really just like stacks, with 8-bit cells. The process of appending to the string is the same.
This implementation for PUSH And POP works, but I am sure that it can be done much more efficiently. Please advise ... thanks!
; ---------------------------------------------------------------------------------------------------------
sPush proc ; val in eax, stack start in ESI
push edx
push ecx
mov ecx, [ebx][esi] ; get the stack pointer
inc ecx ; increment it
mov [ebx][esi], ecx ; put it back
lea edx, [ebx][esi]
shl ecx, 2 ; this stack is DWORDs
mov [edx][ecx], eax
pop ecx
pop edx
ret
sPush endp
; ---------------------------------------------------------------------------------------------------------
sPop proc ; stack start in esi, returns TOS in eax
push edx
push ecx
mov ecx, [ebx][esi] ; get depth counter (stack pointer)
cmp ecx, 0 ; empty?
je isEmpty
shl ecx, 2 ; stack is DWORDs
lea edx, [ebx][esi]
mov eax, [edx][ecx] ; get the TOS
shr ecx, 2
dec ecx ; decrement stack pointer
mov [ebx][esi], ecx ; put SP back
pop ecx
pop edx
ret
isEmpty:
mov eax, 0 ; for now ... this should throw an exception or something
pop ecx
pop edx
ret
sPop endp
why not just use a single pointer
each time you add something, you add 4 to it
and - that pointer doesn't have to be in the allocated block
it's only a long pointer (dword) - put it in global memory to simplify the code
follow the x86 stack as an example
the ESP register always points to the last item added to the stack
when something is PUSH'ed, the stack pointer is adjusted and the new data goes at the address of the new pointer
when someing is POP'ed, the data is pointed to by ESP, and after the POP, ESP is adjusted
Here is a simple example implementing Dave's idea: use dpush and dpop like you would use push & pop:
include \masm32\include\masm32rt.inc ; plain Masm32 for the fans of pure assembler
dstack equ dword ptr [ebx]
dpush MACRO arg
LOCAL oa
add ebx, 4
oa = (opattr arg) AND 127
if (oa eq 36) or (oa eq 48)
mov dword ptr [ebx], arg
else
ifdifi <eax>, <arg>
mov eax, arg
endif
mov [ebx], eax
endif
ENDM
dpop MACRO arg
LOCAL oa
oa = (opattr arg) AND 127
if oa eq 48
mov arg, [ebx]
else
mov eax, [ebx]
ifdifi <eax>, <arg>
mov arg, eax
endif
endif
sub ebx, 4
ENDM
.data?
dataStack LABEL DWORD
ORG $(1000h-DWORD)
dd ?
somedw dd ?
.code
start:
mov ebx, offset dataStack
dpush 123
dpop eax
dpush eax
print str$(eax), " = eax", 13, 10
shl dstack, 1 ; datastack*2
dpop somedw
print str$(somedw), " = somedw", 13, 10
dpush chr$("Masm32 is simple")
dpop edx
inkey edx
exit
end start
Output:
123 = eax
246 = somedw
Masm32 is simple
Simpler & faster way for dpush and dpop, also, doesn't trash eax
dpush MACRO arg
xchg ebx, esp
push arg
xchg ebx, esp
ENDM
dpop MACRO arg
xchg ebx, esp
pop arg
xchg ebx, esp
ENDM
Quote from: rrr314159 on October 14, 2015, 11:03:37 AM
Simpler & faster way for dpush and dpop, also, doesn't trash eax
Two valid arguments :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
192 cycles for 100 * dpush/rpop
1181 cycles for 100 * rpush/rpop
191 cycles for 100 * dpush/rpop
1182 cycles for 100 * rpush/rpop
186 cycles for 100 * dpush/rpop
1180 cycles for 100 * rpush/rpop
191 cycles for 100 * dpush/rpop
1293 cycles for 100 * rpush/rpop
192 cycles for 100 * dpush/rpop
1189 cycles for 100 * rpush/rpop
28 bytes for dpush/rpop
25 bytes for rpush/rpop
XCHG is an instruction you usually try and avoid as it is slow. It wins size wise but at the price of performance if that is what matters.
@jj,
True should have checked the timing (instead of just glancing at cycle counts). pop (in particular) takes longer than I noticed: according to masm32 help, pop takes 6 cycles to mem. A mov is only 1 cycle. Of course 2 of them are needed to equal "pop" using eax, also an "add esp, 4", but still pop appears to be twice as slow as it ought to be. And, xchg takes 3. Even so, measured timing difference is even bigger than that, by a factor of 2! BTW I notice your test has no memory accesses, don't think it would change results.
Of course, it would make a big difference if you had to push/pop eax to preserve it - that might make the two methods about equal in speed.
Also, notice that lining them up (as in the timing test) makes the xchg's wait for each other, that wouldn't happen in normal usage; might make quite a difference ...
But it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.
[edit] re hutch comment, xchg is "slow" but still, seems pop (and push) are surprisingly slow. Also xchg is very slow when going to memory (because it locks it) but reg - reg only 3 cycles (in 486)
i think you'll find that XCHG reg32-reg32 is fairly fast (probably a 1-count)
if you XCHG reg-mem, then it's quite sluggish - it has an implied bus LOCK, and it's a read + write, too
size-wise, it's generally the same as MOV
with the exception - if it's XCHG EAX,reg32 - that's a one-byte
(NOP is actually XCHG EAX,EAX)
If you are worried about stack costs, if you have 3 or less arguments to pass to a procedure, put them in registers EAX, ECX & EDX, then forget the stack if you don't need other registers.
Thanks folks. I think I was being too "all-purpose" and trying to handle both my data stack and other stacks at the same time.
But i have decided to keep it simple and direct. Now it looks like this:
; ---------------------------------------------------------------------------------------------------------
fPush proc ; val in eax, esi ALWAYS points to TOS
add esi, 4 ; values on the stack are 32 bits
mov [esi], eax
ret
fPush endp
; ---------------------------------------------------------------------------------------------------------
fPop proc ; esi ALWAYS points to TOS, returns TOS in eax
cmp pDataStack, esi
jge isEmpty
mov eax, [esi]
sub esi, 4 ; values on the stack are 32 bits
ret
isEmpty: ; make sure it is setup correctly
xor eax, eax ; I should probably throw an exception or something down the road
mov esi, pDataStack
mov [esi], eax
ret
fPop endp
Quote from: rrr314159 on October 14, 2015, 01:58:51 PMBut it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.
Datz de problem with you lazy guys, always neglecting the beauties of speed contests in favour of long philosophical essays on the End of the World and why Wittgenstein was even wronger than Einstein :(
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
263 cycles for 100 * dpush/dpop with error check
1199 cycles for 100 * rpush/rpop
422 cycles for 100 * 5*push & pop, serial
370 cycles for 100 * 5*push & pop, interlaced
262 cycles for 100 * dpush/dpop with error check
1202 cycles for 100 * rpush/rpop
431 cycles for 100 * 5*push & pop, serial
380 cycles for 100 * 5*push & pop, interlaced
301 cycles for 100 * dpush/dpop with error check
1529 cycles for 100 * rpush/rpop
541 cycles for 100 * 5*push & pop, serial
378 cycles for 100 * 5*push & pop, interlaced
263 cycles for 100 * dpush/dpop with error check
1202 cycles for 100 * rpush/rpop
439 cycles for 100 * 5*push & pop, serial
376 cycles for 100 * 5*push & pop, interlaced
262 cycles for 100 * dpush/dpop with error check
1202 cycles for 100 * rpush/rpop
430 cycles for 100 * 5*push & pop, serial
375 cycles for 100 * 5*push & pop, interlaced
As you can clearly see,
pushpops are fast :greensml:
@OP: Sorry, we are in spamming mode here :biggrin:
Back on topic: Your solution is OK. Mine added a check for the data type passed, so you could really dpush and dpop whatever you needed. This does NOT add additional code, because the check is done at assembly time. As rrr rightly pointed out, eax would get trashed; here is a version that trashes edx instead (which is a
real trash register, unlike valuable eax ;)):
dpush MACRO arg
LOCAL oa
add ebx, 4
oa = (opattr arg) AND 127
if (oa eq 36) or (oa eq 48)
mov dword ptr [ebx], arg
else
ifdifi <edx>, <arg>
mov edx, arg
endif
mov [ebx], edx
endif
ENDM
dpop MACRO arg
LOCAL oa
oa = (opattr arg) AND 127
if oa eq 48
mov arg, [ebx]
else
mov edx, [ebx]
ifdifi <edx>, <arg>
mov arg, edx
endif
endif
sub ebx, 4
mov edx, @Line
cmp ebx, offset dataStack-4
jb dpopErr
ENDM
Full code attached, with error check. Here is a simple usage example:
include \masm32\include\masm32rt.inc
include dPushPop.asm ; see attachment
.data?
somedw dd ?
.code
start:
mov ebx, offset dataStack-4
dpush 123
dpop eax
dpush eax
print str$(eax), " = eax", 13, 10
shl dstack, 1 ; datastack*2
dpop somedw
print str$(somedw), " = somedw", 13, 10
dpush chr$("Masm32 is simple", 13, 10)
print dstack
dpop eax
print eax, "(that was the popped eax)", 13, 10
dpop ecx ; <<<<<<<<<< THAT WAS ONE TOO MUCH!!!
print str$(ecx), " = ecx popped in line 21- you won't see this message", 13, 10
exit
end start
Output:123 = eax
246 = somedw
Masm32 is simple
Masm32 is simple
(that was the popped eax)
P.S.: The error check makes it a bit slower and adds a few bytes of bloat. But you can easily switch it off with
dpErrCheck=0:
dpErrCheck=1
...
dpop MACRO arg
LOCAL oa
oa = (opattr arg) AND 127
if oa eq 48
mov arg, [ebx]
else
mov edx, [ebx]
ifdifi <edx>, <arg>
mov arg, edx
endif
endif
sub ebx, 4
if dpErrCheck
mov edx, @Line
cmp ebx, offset dataStack-4
jb dpopErr
endif
ENDM
If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished. :biggrin:
Quote from: hutch-- on October 14, 2015, 08:50:33 PM
If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished. :biggrin:
That is extremely politically incorrect, Hutch, and you know it :P
Trashing edx is much better than trashing eax, because you may need eax as returned from a WinApi call. Using MMX trashes the FPU: movd mm0, eax trashes ST0...ST3, and your previous values will be moved to ST4...ST7 (at least on my core i5, not sure if it is documented somewhere).
But movd
xmm0 could be an option:
movd xmm0, eax
movd eax, xmm0
8 bytes of bloat, though. Trashing edx seems more appropriate IMHO.
deleted
Quote
You may also use ESP directly
lea esp,[ebx+offset]
Now you can use PUSH and POP.
Dang ... that's pretty slick! But won't I need to restore ESP at some point? And how would I do that?
Quote
Same as ECX*4
mov [edx+ecx*4],eax
does this get expanded by the compiler into essentially the same thing as I have, or can it execute that directly?
RE moving in the reverse direction, I'm OK with that, so long as using that approach is more efficient. As you probably know, Forth is, at its heart, a stack machine, so managing the stack as efficiently as possible is probably the single largest performance concern.
deleted
Quote from: nidud on October 15, 2015, 03:20:25 AM
push ebp
mov ebp,esp
..
leave
or
mov esp,ebp
pop ebp
Added to testbed as npush/npop (without error check). I chose the leave variant because it is much faster - you can try the other one by setting useleave=0 in line 3. Results (cycles for 200 push/pop pairs):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
270 cycles for 100 * dpush/dpop with error check
1221 cycles for 100 * rpush/rpop
1214 cycles for 100 * npush/npop
216 cycles for 100 * dpush/dpop without error check
271 cycles for 100 * dpush/dpop with error check
1222 cycles for 100 * rpush/rpop
1191 cycles for 100 * npush/npop
220 cycles for 100 * dpush/dpop without error check
271 cycles for 100 * dpush/dpop with error check
1220 cycles for 100 * rpush/rpop
1224 cycles for 100 * npush/npop
219 cycles for 100 * dpush/dpop without error check
272 cycles for 100 * dpush/dpop with error check
1257 cycles for 100 * rpush/rpop
1200 cycles for 100 * npush/npop
215 cycles for 100 * dpush/dpop without error check
270 cycles for 100 * dpush/dpop with error check
1218 cycles for 100 * rpush/rpop
1205 cycles for 100 * npush/npop
216 cycles for 100 * dpush/dpop without error check
54 bytes for dpush/dpop with error check
25 bytes for rpush/rpop
45 bytes for npush/npop
28 bytes for dpush/dpop without error check
Note also that those versions that push downwards (as in the normal push/pop via esp) cannot simply extend the stack with HeapReAlloc.
deleted
Quote from: nidud on October 15, 2015, 03:20:25 AMAlso keep in mind that you will have to restore the stack if you make a call to an external proc.
I see ::)
So npush & npop would only be applicable in areas of the program that don't use things like print, Msgbox, InString, len(), lstrcpy, val(), ...?
deleted
mov edi, AlgoLoops-1 ; loop e.g. 100x
push ebp
mov ebp,esp
mov esp, offset dataStack+1000h ; end of stack area
align 4
.Repeat
npush 123
npop eax
npush eax
npop ecx
dec edi
.Until Sign?
leave
ret
Shouldn't that be like that (for reasons of comparability)? See Reply #18 for possible reasons.
mov edi, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
push ebp
mov ebp,esp
mov esp, offset dataStack+1000h ; end of stack area
npush 123
npop eax
npush eax
npop ecx
leave
dec edi
.Until Sign?
leave
ret
deleted
Quote from: nidud on October 15, 2015, 05:30:22 AM
:biggrin:
You were thinking of something like this:
dpErrCheck=1 ; do some MasmBasic stuff here..
The code I posted is plain assembler, no MasmBasic in there. But my dpush/dpop allows to use print, InString, len(), lstrcpy, val(), ... and every imaginable WinAPI call in parallel. I doubt that CCurl can implement a FORTH machine without ever using the coder's bread-and-butter functions. Besides, my version is short and fast. But you know what? Assembler is learning by crashing, pardon: doing. Nobody is obliged to use somebody else's code, even if it's shorter and faster. Roll your own, homegrown code is the best :P