News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Stacks and counted strings

Started by CCurl, October 14, 2015, 07:43:27 AM

Previous topic - Next topic

CCurl

So here is my next question ... I am implementing a Forth system (for fun and to learn MASM), so I need a data stack, which is separate from the return stack. In my implementation, it is just an small part of a big chunk of memory which I have allocated using HeapAlloc.

By convention, I have [ebx] pointing to the start of my allocated memory, and [esi] as the offset into that memory where the stack starts, so [ebx][esi] refers to the start of the stack. In my implementation, the stack pointer (depth counter) is the first cell, followed by the data. So to push a value onto the stack, I increment the depth counter and then place the value at [start+depth]. 32-bit stacks need to move 4 addresses at a time, hence the SHL ecx, 2.

Note that counted strings are really just like stacks, with 8-bit cells. The process of appending to the string is the same.

This implementation for PUSH And POP works, but I am sure that it can be done much more efficiently. Please advise ... thanks!

; ---------------------------------------------------------------------------------------------------------
sPush proc               ; val in eax, stack start in ESI
   push edx
   push ecx
   mov  ecx, [ebx][esi]      ; get the stack pointer
   inc  ecx            ; increment it
   mov  [ebx][esi], ecx      ; put it back
   lea  edx, [ebx][esi]
   shl  ecx, 2            ; this stack is DWORDs
   mov  [edx][ecx], eax
   pop  ecx
   pop  edx
   ret
sPush endp

; ---------------------------------------------------------------------------------------------------------
sPop proc               ; stack start in esi, returns TOS in eax
   push edx
   push ecx
   mov  ecx, [ebx][esi]      ; get depth counter (stack pointer)
   cmp  ecx, 0         ; empty?
   je   isEmpty
   shl  ecx, 2            ; stack is DWORDs
   lea  edx, [ebx][esi]
   mov  eax, [edx][ecx]      ; get the TOS
   shr  ecx, 2
   dec  ecx            ; decrement stack pointer
   mov  [ebx][esi], ecx      ; put SP back
   pop  ecx
   pop  edx
   ret

isEmpty:
   mov  eax, 0            ; for now ... this should throw an exception or something
   pop  ecx
   pop  edx
   ret
sPop endp

dedndave

why not just use a single pointer
each time you add something, you add 4 to it
and - that pointer doesn't have to be in the allocated block
it's only a long pointer (dword) - put it in global memory to simplify the code

follow the x86 stack as an example
the ESP register always points to the last item added to the stack
when something is PUSH'ed, the stack pointer is adjusted and the new data goes at the address of the new pointer
when someing is POP'ed, the data is pointed to by ESP, and after the POP, ESP is adjusted

jj2007

Here is a simple example implementing Dave's idea: use dpush and dpop like you would use push & pop:

include \masm32\include\masm32rt.inc ; plain Masm32 for the fans of pure assembler

dstack equ dword ptr [ebx]
dpush MACRO arg
LOCAL oa
  add ebx, 4
  oa = (opattr arg) AND 127
  if (oa eq 36) or (oa eq 48)
mov dword ptr [ebx], arg
  else
ifdifi <eax>, <arg>
mov eax, arg
endif
mov [ebx], eax
  endif
ENDM

dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
mov arg, [ebx]
  else
mov eax, [ebx]
ifdifi <eax>, <arg>
mov arg, eax
endif
  endif
  sub ebx, 4
ENDM

.data?
dataStack LABEL DWORD
  ORG $(1000h-DWORD)
  dd ?
somedw dd ?

.code
start:
  mov ebx, offset dataStack
  dpush 123
  dpop eax
  dpush eax
  print str$(eax), " = eax", 13, 10
  shl dstack, 1 ; datastack*2
  dpop somedw
  print str$(somedw), " = somedw", 13, 10
  dpush chr$("Masm32 is simple")
  dpop edx
  inkey edx
  exit

end start


Output:
123 = eax
246 = somedw
Masm32 is simple

rrr314159

Simpler & faster way for dpush and dpop, also, doesn't trash eax

dpush MACRO arg
    xchg ebx, esp
    push arg
    xchg ebx, esp

ENDM

dpop MACRO arg
    xchg ebx, esp
    pop arg
    xchg ebx, esp
ENDM
I am NaN ;)

jj2007

Quote from: rrr314159 on October 14, 2015, 11:03:37 AM
Simpler & faster way for dpush and dpop, also, doesn't trash eax

Two valid arguments :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
192     cycles for 100 * dpush/rpop
1181    cycles for 100 * rpush/rpop

191     cycles for 100 * dpush/rpop
1182    cycles for 100 * rpush/rpop

186     cycles for 100 * dpush/rpop
1180    cycles for 100 * rpush/rpop

191     cycles for 100 * dpush/rpop
1293    cycles for 100 * rpush/rpop

192     cycles for 100 * dpush/rpop
1189    cycles for 100 * rpush/rpop

28      bytes for dpush/rpop
25      bytes for rpush/rpop

hutch--

XCHG is an instruction you usually try and avoid as it is slow. It wins size wise but at the price of performance if that is what matters.

rrr314159

@jj,

True should have checked the timing (instead of just glancing at cycle counts). pop (in particular) takes longer than I noticed: according to masm32 help, pop takes 6 cycles to mem. A mov is only 1 cycle. Of course 2 of them are needed to equal "pop" using eax, also an "add esp, 4", but still pop appears to be twice as slow as it ought to be. And, xchg takes 3. Even so, measured timing difference is even bigger than that, by a factor of 2! BTW I notice your test has no memory accesses, don't think it would change results.

Of course, it would make a big difference if you had to push/pop eax to preserve it - that might make the two methods about equal in speed.

Also, notice that lining them up (as in the timing test) makes the xchg's wait for each other, that wouldn't happen in normal usage; might make quite a difference ...

But it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.

[edit] re hutch comment, xchg is "slow" but still, seems pop (and push) are surprisingly slow. Also xchg is very slow when going to memory (because it locks it) but reg - reg only 3 cycles (in 486)
I am NaN ;)

dedndave

i think you'll find that XCHG reg32-reg32 is fairly fast (probably a 1-count)
if you XCHG reg-mem, then it's quite sluggish - it has an implied bus LOCK, and it's a read + write, too

size-wise, it's generally the same as MOV
with the exception - if it's XCHG EAX,reg32 - that's a one-byte
(NOP is actually XCHG EAX,EAX)

hutch--

If you are worried about stack costs, if you have 3 or less arguments to pass to a procedure, put them in registers EAX, ECX & EDX, then forget the stack if you don't need other registers.

CCurl

Thanks folks. I think I was being too "all-purpose" and trying to handle both my data stack and other stacks at the same time.

But i have decided to keep it simple and direct. Now it looks like this:

; ---------------------------------------------------------------------------------------------------------
fPush proc               ; val in eax, esi ALWAYS points to TOS
   add  esi, 4            ; values on the stack are 32 bits
   mov  [esi], eax
   ret
fPush endp

; ---------------------------------------------------------------------------------------------------------
fPop proc               ; esi ALWAYS points to TOS, returns TOS in eax
   cmp  pDataStack, esi
   jge  isEmpty
   mov eax, [esi]
   sub  esi, 4            ; values on the stack are 32 bits
   ret

isEmpty:               ; make sure it is setup correctly
   xor  eax, eax         ; I should probably throw an exception or something down the road
   mov  esi, pDataStack
   mov  [esi], eax
   ret
fPop endp

jj2007

Quote from: rrr314159 on October 14, 2015, 01:58:51 PMBut it does make me think twice about casually throwing in push / pop pairs when convenient; apparently it's a lot faster to save regs to a memory location instead! Strange, someday I might investigate further; 2 lazy now.

Datz de problem with you lazy guys, always neglecting the beauties of speed contests in favour of long philosophical essays on the End of the World and why Wittgenstein was even wronger than Einstein :(

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

263     cycles for 100 * dpush/dpop with error check
1199    cycles for 100 * rpush/rpop
422     cycles for 100 * 5*push & pop, serial
370     cycles for 100 * 5*push & pop, interlaced

262     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
431     cycles for 100 * 5*push & pop, serial
380     cycles for 100 * 5*push & pop, interlaced

301     cycles for 100 * dpush/dpop with error check
1529    cycles for 100 * rpush/rpop
541     cycles for 100 * 5*push & pop, serial
378     cycles for 100 * 5*push & pop, interlaced

263     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
439     cycles for 100 * 5*push & pop, serial
376     cycles for 100 * 5*push & pop, interlaced

262     cycles for 100 * dpush/dpop with error check
1202    cycles for 100 * rpush/rpop
430     cycles for 100 * 5*push & pop, serial
375     cycles for 100 * 5*push & pop, interlaced


As you can clearly see, pushpops are fast :greensml:

@OP: Sorry, we are in spamming mode here :biggrin:

Back on topic: Your solution is OK. Mine added a check for the data type passed, so you could really dpush and dpop whatever you needed. This does NOT add additional code, because the check is done at assembly time. As rrr rightly pointed out, eax would get trashed; here is a version that trashes edx instead (which is a real trash register, unlike valuable eax ;)):
dpush MACRO arg
LOCAL oa
  add ebx, 4
  oa = (opattr arg) AND 127
  if (oa eq 36) or (oa eq 48)
mov dword ptr [ebx], arg
  else
ifdifi <edx>, <arg>
mov edx, arg
endif
mov [ebx], edx
  endif
ENDM

dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
mov arg, [ebx]
  else
mov edx, [ebx]
ifdifi <edx>, <arg>
mov arg, edx
endif
  endif
  sub ebx, 4
  mov edx, @Line
  cmp ebx, offset dataStack-4
  jb dpopErr
ENDM


Full code attached, with error check. Here is a simple usage example:
include \masm32\include\masm32rt.inc
include dPushPop.asm ; see attachment

.data?
somedw dd ?

.code
start:
  mov ebx, offset dataStack-4
  dpush 123
  dpop eax
  dpush eax
  print str$(eax), " = eax", 13, 10
  shl dstack, 1 ; datastack*2
  dpop somedw
  print str$(somedw), " = somedw", 13, 10
  dpush chr$("Masm32 is simple", 13, 10)
  print dstack
  dpop eax
  print eax, "(that was the popped eax)", 13, 10
  dpop ecx ; <<<<<<<<<< THAT WAS ONE TOO MUCH!!!
  print str$(ecx), " = ecx popped in line 21- you won't see this message", 13, 10
  exit

end start


Output:
123 = eax
246 = somedw
Masm32 is simple
Masm32 is simple
(that was the popped eax)


P.S.: The error check makes it a bit slower and adds a few bytes of bloat. But you can easily switch it off with dpErrCheck=0:
dpErrCheck=1
...
dpop MACRO arg
LOCAL oa
  oa = (opattr arg) AND 127
  if oa eq 48
mov arg, [ebx]
  else
mov edx, [ebx]
ifdifi <edx>, <arg>
mov arg, edx
endif
  endif
  sub ebx, 4
  if dpErrCheck
mov edx, @Line
cmp ebx, offset dataStack-4
jb dpopErr
  endif
ENDM

hutch--

If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished.  :biggrin:

jj2007

Quote from: hutch-- on October 14, 2015, 08:50:33 PM
If you enjoy being politically incorrect, store your registers in MMX registers then load them back into the normal registers when you are finished.  :biggrin:

That is extremely politically incorrect, Hutch, and you know it :P

Trashing edx is much better than trashing eax, because you may need eax as returned from a WinApi call. Using MMX trashes the FPU: movd mm0, eax trashes ST0...ST3, and your previous values will be moved to ST4...ST7 (at least on my core i5, not sure if it is documented somewhere).

But movd xmm0 could be an option:
movd xmm0, eax
movd eax, xmm0


8 bytes of bloat, though. Trashing edx seems more appropriate IMHO.

nidud

#13
deleted

CCurl

Quote
You may also use ESP directly

lea esp,[ebx+offset]


Now you can use PUSH and POP.
Dang ... that's pretty slick! But won't I need to restore ESP at some point? And how would I do that?
Quote
Same as ECX*4

mov [edx+ecx*4],eax

does this get expanded by the  compiler into essentially the same thing as I have, or can it execute that directly?

RE moving in the reverse direction, I'm OK with that, so long as using that approach is more efficient. As you probably know, Forth is, at its heart, a stack machine, so managing the stack as efficiently as possible is probably the single largest performance concern.