Stacks and counted strings

CCurl · October 14, 2015, 07:43:27 AM

So here is my next question ... I am implementing a Forth system (for fun and to learn MASM), so I need a data stack, which is separate from the return stack. In my implementation, it is just an small part of a big chunk of memory which I have allocated using HeapAlloc.

By convention, I have [ebx] pointing to the start of my allocated memory, and [esi] as the offset into that memory where the stack starts, so [ebx][esi] refers to the start of the stack. In my implementation, the stack pointer (depth counter) is the first cell, followed by the data. So to push a value onto the stack, I increment the depth counter and then place the value at [start+depth]. 32-bit stacks need to move 4 addresses at a time, hence the SHL ecx, 2.

Note that counted strings are really just like stacks, with 8-bit cells. The process of appending to the string is the same.

This implementation for PUSH And POP works, but I am sure that it can be done much more efficiently. Please advise ... thanks!

; ---------------------------------------------------------------------------------------------------------
sPush proc               ; val in eax, stack start in ESI
   push edx
   push ecx
   mov ecx, [ebx][esi]      ; get the stack pointer
   inc ecx            ; increment it
   mov [ebx][esi], ecx      ; put it back
   lea edx, [ebx][esi]
   shl ecx, 2            ; this stack is DWORDs
   mov [edx][ecx], eax
   pop ecx
   pop edx
   ret
sPush endp

; ---------------------------------------------------------------------------------------------------------
sPop proc               ; stack start in esi, returns TOS in eax
   push edx
   push ecx
   mov ecx, [ebx][esi]      ; get depth counter (stack pointer)
   cmp ecx, 0         ; empty?
   je isEmpty
   shl ecx, 2            ; stack is DWORDs
   lea edx, [ebx][esi]
   mov eax, [edx][ecx]      ; get the TOS
   shr ecx, 2
   dec ecx            ; decrement stack pointer
   mov [ebx][esi], ecx      ; put SP back
   pop ecx
   pop edx
   ret

isEmpty:
   mov eax, 0            ; for now ... this should throw an exception or something
   pop ecx
   pop edx
   ret
sPop endp

dedndave · October 14, 2015, 08:48:56 AM

why not just use a single pointer
each time you add something, you add 4 to it
and - that pointer doesn't have to be in the allocated block
it's only a long pointer (dword) - put it in global memory to simplify the code

follow the x86 stack as an example
the ESP register always points to the last item added to the stack
when something is PUSH'ed, the stack pointer is adjusted and the new data goes at the address of the new pointer
when someing is POP'ed, the data is pointed to by ESP, and after the POP, ESP is adjusted

jj2007 · October 14, 2015, 10:44:56 AM

Here is a simple example implementing Dave's idea: use dpush and dpop like you would use push & pop:

Code Select

include \masm32\include\masm32rt.inc	; plain Masm32 for the fans of pure assembler

dstack equ dword ptr [ebx]
dpush MACRO arg
LOCAL oa
  add ebx, 4
  oa = (opattr arg) AND 127
  if (oa eq 36) or (oa eq 48)
	mov dword ptr [ebx], arg
  else
	ifdifi <eax>, <arg>
		mov eax, arg
	endif
	mov [ebx], eax
  endif
ENDM

dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
	mov arg, [ebx]
  else
	mov eax, [ebx]
	ifdifi <eax>, <arg>
		mov arg, eax
	endif
  endif
  sub ebx, 4
ENDM

.data?
dataStack	LABEL DWORD
  ORG $(1000h-DWORD)
  dd ?
somedw	dd ?

.code
start:
  mov ebx, offset dataStack
  dpush 123
  dpop eax
  dpush eax
  print str$(eax), " = eax", 13, 10
  shl dstack, 1		; datastack*2
  dpop somedw
  print str$(somedw), " = somedw", 13, 10
  dpush chr$("Masm32 is simple")
  dpop edx
  inkey edx
  exit

end start

Output:

Code Select

123 = eax
246 = somedw
Masm32 is simple

rrr314159 · October 14, 2015, 11:03:37 AM

Simpler & faster way for dpush and dpop, also, doesn't trash eax

Code Select

dpush MACRO arg
    xchg ebx, esp
    push arg
    xchg ebx, esp

ENDM

dpop MACRO arg
    xchg ebx, esp
    pop arg
    xchg ebx, esp
ENDM

jj2007 · October 14, 2015, 01:06:16 PM

Quote from: rrr314159 on October 14, 2015, 11:03:37 AM
Simpler & faster way for dpush and dpop, also, doesn't trash eax

Two valid arguments

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
192     cycles for 100 * dpush/rpop
1181    cycles for 100 * rpush/rpop

191     cycles for 100 * dpush/rpop
1182    cycles for 100 * rpush/rpop

186     cycles for 100 * dpush/rpop
1180    cycles for 100 * rpush/rpop

191     cycles for 100 * dpush/rpop
1293    cycles for 100 * rpush/rpop

192     cycles for 100 * dpush/rpop
1189    cycles for 100 * rpush/rpop

28      bytes for dpush/rpop
25      bytes for rpush/rpop

hutch-- · October 14, 2015, 01:55:29 PM

XCHG is an instruction you usually try and avoid as it is slow. It wins size wise but at the price of performance if that is what matters.

rrr314159 · October 14, 2015, 01:58:51 PM

@jj,

True should have checked the timing (instead of just glancing at cycle counts). pop (in particular) takes longer than I noticed: according to masm32 help, pop takes 6 cycles to mem. A mov is only 1 cycle. Of course 2 of them are needed to equal "pop" using eax, also an "add esp, 4", but still pop appears to be twice as slow as it ought to be. And, xchg takes 3. Even so, measured timing difference is even bigger than that, by a factor of 2! BTW I notice your test has no memory accesses, don't think it would change results.

Of course, it would make a big difference if you had to push/pop eax to preserve it - that might make the two methods about equal in speed.

Also, notice that lining them up (as in the timing test) makes the xchg's wait for each other, that wouldn't happen in normal usage; might make quite a difference ...

But it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.

[edit] re hutch comment, xchg is "slow" but still, seems pop (and push) are surprisingly slow. Also xchg is very slow when going to memory (because it locks it) but reg - reg only 3 cycles (in 486)

dedndave · October 14, 2015, 02:34:38 PM

i think you'll find that XCHG reg32-reg32 is fairly fast (probably a 1-count)
if you XCHG reg-mem, then it's quite sluggish - it has an implied bus LOCK, and it's a read + write, too

size-wise, it's generally the same as MOV
with the exception - if it's XCHG EAX,reg32 - that's a one-byte
(NOP is actually XCHG EAX,EAX)

hutch-- · October 14, 2015, 02:55:27 PM

If you are worried about stack costs, if you have 3 or less arguments to pass to a procedure, put them in registers EAX, ECX & EDX, then forget the stack if you don't need other registers.

CCurl · October 14, 2015, 04:09:48 PM

Thanks folks. I think I was being too "all-purpose" and trying to handle both my data stack and other stacks at the same time.

But i have decided to keep it simple and direct. Now it looks like this:

; ---------------------------------------------------------------------------------------------------------
fPush proc               ; val in eax, esi ALWAYS points to TOS
   add esi, 4            ; values on the stack are 32 bits
   mov [esi], eax
   ret
fPush endp

; ---------------------------------------------------------------------------------------------------------
fPop proc               ; esi ALWAYS points to TOS, returns TOS in eax
   cmp pDataStack, esi
   jge isEmpty
   mov eax, [esi]
   sub esi, 4            ; values on the stack are 32 bits
   ret

isEmpty:               ; make sure it is setup correctly
   xor eax, eax         ; I should probably throw an exception or something down the road
   mov esi, pDataStack
   mov [esi], eax
   ret
fPop endp

jj2007 · October 14, 2015, 07:02:14 PM

Quote from: rrr314159 on October 14, 2015, 01:58:51 PMBut it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.

Datz de problem with you lazy guys, always neglecting the beauties of speed contests in favour of long philosophical essays on the End of the World and why Wittgenstein was even wronger than Einstein :(

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

263     cycles for 100 * dpush/dpop with error check
1199    cycles for 100 * rpush/rpop
422     cycles for 100 * 5*push & pop, serial
370     cycles for 100 * 5*push & pop, interlaced

262     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
431     cycles for 100 * 5*push & pop, serial
380     cycles for 100 * 5*push & pop, interlaced

301     cycles for 100 * dpush/dpop with error check
1529    cycles for 100 * rpush/rpop
541     cycles for 100 * 5*push & pop, serial
378     cycles for 100 * 5*push & pop, interlaced

263     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
439     cycles for 100 * 5*push & pop, serial
376     cycles for 100 * 5*push & pop, interlaced

262     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
430     cycles for 100 * 5*push & pop, serial
375     cycles for 100 * 5*push & pop, interlaced

As you can clearly see, pushpops are fast

@OP: Sorry, we are in spamming mode here

Back on topic: Your solution is OK. Mine added a check for the data type passed, so you could really dpush and dpop whatever you needed. This does NOT add additional code, because the check is done at assembly time. As rrr rightly pointed out, eax would get trashed; here is a version that trashes edx instead (which is a real trash register, unlike valuable eax ;)):

Code Select

dpush MACRO arg
LOCAL oa
  add ebx, 4
  oa = (opattr arg) AND 127
  if (oa eq 36) or (oa eq 48)
	mov dword ptr [ebx], arg
  else
	ifdifi <edx>, <arg>
		mov edx, arg
	endif
	mov [ebx], edx
  endif
ENDM

dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
	mov arg, [ebx]
  else
	mov edx, [ebx]
	ifdifi <edx>, <arg>
		mov arg, edx
	endif
  endif
  sub ebx, 4
  mov edx, @Line
  cmp ebx, offset dataStack-4
  jb dpopErr
ENDM

Full code attached, with error check. Here is a simple usage example:

Code Select

include \masm32\include\masm32rt.inc
include dPushPop.asm		; see attachment

.data?
somedw	dd ?

.code
start:
  mov ebx, offset dataStack-4
  dpush 123
  dpop eax
  dpush eax
  print str$(eax), " = eax", 13, 10
  shl dstack, 1		; datastack*2
  dpop somedw
  print str$(somedw), " = somedw", 13, 10
  dpush chr$("Masm32 is simple", 13, 10)
  print dstack
  dpop eax
  print eax, "(that was the popped eax)", 13, 10
  dpop ecx		; <<<<<<<<<< THAT WAS ONE TOO MUCH!!!
  print str$(ecx), " = ecx popped in line 21- you won't see this message", 13, 10
  exit

end start

Output:

Code Select

123 = eax
246 = somedw
Masm32 is simple
Masm32 is simple
(that was the popped eax)

P.S.: The error check makes it a bit slower and adds a few bytes of bloat. But you can easily switch it off with dpErrCheck=0:

Code Select

dpErrCheck=1
...
dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
	mov arg, [ebx]
  else
	mov edx, [ebx]
	ifdifi <edx>, <arg>
		mov arg, edx
	endif
  endif
  sub ebx, 4
  if dpErrCheck
	mov edx, @Line
	cmp ebx, offset dataStack-4
	jb dpopErr
  endif
ENDM

hutch-- · October 14, 2015, 08:50:33 PM

If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished.

jj2007 · October 14, 2015, 09:03:24 PM

Quote from: hutch-- on October 14, 2015, 08:50:33 PM
If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished.

That is extremely politically incorrect, Hutch, and you know it :P

Trashing edx is much better than trashing eax, because you may need eax as returned from a WinApi call. Using MMX trashes the FPU: movd mm0, eax trashes ST0...ST3, and your previous values will be moved to ST4...ST7 (at least on my core i5, not sure if it is documented somewhere).

But movd xmm0 could be an option:

Code Select

	movd xmm0, eax
	movd eax, xmm0

8 bytes of bloat, though. Trashing edx seems more appropriate IMHO.

nidud · October 14, 2015, 10:26:03 PM

deleted

CCurl · October 15, 2015, 02:42:37 AM

Quote
You may also use ESP directly
Code Select Expand
lea esp,[ebx+offset]

Now you can use PUSH and POP.

Dang ... that's pretty slick! But won't I need to restore ESP at some point? And how would I do that?

Quote
Same as ECX*4
Code Select Expand
mov [edx+ecx*4],eax

does this get expanded by the compiler into essentially the same thing as I have, or can it execute that directly?

RE moving in the reverse direction, I'm OK with that, so long as using that approach is more efficient. As you probably know, Forth is, at its heart, a stack machine, so managing the stack as efficiently as possible is probably the single largest performance concern.

The MASM Forum

News:

Stacks and counted strings

CCurl

dedndave

jj2007

rrr314159

jj2007

hutch--

rrr314159

dedndave

hutch--

CCurl

jj2007

hutch--

jj2007

nidud

CCurl