News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing Text file in Assembly Language

Started by NoCforMe, July 07, 2023, 12:27:01 PM

Previous topic - Next topic

NoCforMe

No, it would be kind of a big deal. Here's the thing: the tokenizer in the demo (not the parser) looks for all non-numeric chunks of text as "identifiers". This includes

  • variable names
  • register names
  • the tokens "INVOKE" and "ADDR"
none of which contain spaces.

To allow constructs like [RAX + RDX].Table + 2, the tokenizer would have to be expanded to cover expressions within square brackets and arithmetic expressions. Plus the parser would have to be able to follow these sequences. You can see that this is definitely a non-trivial process.

The tokens for that expression would be

  • left square bracket ('[')
  • register (RAX)
  • plus sign ('+')
  • register (RDX)
  • right square bracket (']')
  • period ('.')
  • ID ('Table')
  • number ('2')
Assembly language programming should be fun. That's why I do it.

lingo

Before: buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0

Result: buffera db "Invoke someproc,[RDX].Table+12,[RAX+RBX+Table],[RAX+RBX+Table]",0,0,0,0  :tongue:


.data
TableDividers db ",", ".", "+", "-", "*","/","(",")","[","]",0
buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0,0,0


.code
Parser proc

LOCAL SaveRCX :QWORD,SaveRDX: QWORD, SaveRBX: QWORDRD
mov    SaveRCX,rcx
mov    SaveRDX,rdx
mov    SaveRBX,rbx
               lea   rax,buffera           ; source
               mov   rbx, rax    ; dest
@@:
               mov   cl, byte ptr[rax]
               add   rax,1
               test  cl,cl   
               je    ende   
               cmp   cl,20h
               jne   @laba3
               cmp   byte ptr [rax],20h
             lea   rax,[rax+1]
               je    @b   ; Skip more spaces
       mov   ch,byte ptr [rax-3]
       sub   rax,1
               lea   rdx, TableDividers
@laba1:
               cmp   ch,byte ptr [rdx]
       je    @b
       cmp   byte ptr [rdx],0 ;not found
       lea   rdx,[rdx+1]
               jne   @laba1
       mov   ch, byte ptr [rax]
               lea   rdx, TableDividers
@laba2:
               cmp   ch,byte ptr [rdx]
       je    @b
       cmp   byte ptr [rdx],0 ;not found
       lea   rdx,[rdx+1]
               jne   @laba2
@laba3:
       mov   byte ptr [rbx],cl
               add   rbx,1
               jmp   @b
ende:
               mov   dword ptr [rbx],0
               lea   rax,buffera

mov rcx, SaveRCX
mov rdx, SaveRDX
mov rbx, SaveRBX
               ret
Parser         endp


:tongue:
Quid sit futurum cras fuge quaerere.

NoCforMe

#77
OK, well, maybe ... I can see it does work, but it's basically cheating, and isn't that only designed for that particular address expression?

Also, suggestion: some comments would help. I can't figure out from glancing at it just what the hell your code does.

But interesting. A+ for cleverness. (Oh, and I really like your animated avatar. Cuuute.)
Assembly language programming should be fun. That's why I do it.

NoCforMe

Getting back to my demo, for anyone who's interested in the guts of the thing, this is the data structure that drives the whole parsing process (after tokenization):


;***** The Parsing Sequence  *****
ParsingSequence LABEL $Pnode

_pn0 $Pnode <_pn1, $T_INVOKE, NULL>
DD -1

_pn1 $Pnode <_pn2, $T_ID, StoreFname>
DD -1

_pn2 $Pnode <_pn3, $T_comma, NULL>
DD -1

_pn3 $Pnode <_pn5, $T_ID, StoreArg>
$Pnode <_pn5, $T_number, StoreArg>
$Pnode <_pn5, $T_register, StoreArg>
$Pnode <_pn4, $T_ADDR, TagArgAsADDR>
DD -1

_pn4 $Pnode <_pn5, $T_ID, StoreArg>
DD -1

_pn5 $Pnode <_pn3, $T_comma, NULL>
$Pnode <NULL, $T_EOL, NULL>
DD -1


That's all, a linked list of $Pnode structures. You can follow it through:

1. At node _pn0, see the token INVOKE, go to _pn1.
2. At node _pn1, see the token ID (function name), call StoreFname(), go to _pn2.
3. At node _pn2, see a comma, go to _pn3.
4. At node _pn3, see the token ID, call StoreArg(), go to _pn5
   see the token number, call StoreArg(), go to _pn5
   see the token register, call StoreArg(), go to _pn5
   see the token ADDR, call TagArgAsADDR(), go to _pn4.
5. At node _pn4, see the token ID, call StoreArg(), go to _pn5.
6. At node _pn5, see a comma, go back to _pn3.
    see the token EOL (end o'line), STOP (parser sees $T_EOL and ends processing).

So assuming your tokenizer gives you all the tokens your text contains, you only need to expand on this structure to do all kinds of parsing tasks (with some small stub subroutines to go along with it). That's the beauty of this method (if I don't say so myself).
Assembly language programming should be fun. That's why I do it.

HSE

Quote from: HSE on July 12, 2023, 02:02:53 PM
but I'm still in example 2  :biggrin:

Ok! Second essay example is working, in a not very impressive way  :biggrin::$parseSuccess

Press any key to continue ...


Just the skeleton to see how parser work in a debugger.

Because the code is already in the text, I added a little challenge: to make a Neutral Bitness Code (Friedrich et al. syntax).

Then you can build same code with ML or ML64, using MASM32 SDK or MASM64 SDK, resulting obviously a 32 or 64 bits binary file.

Probably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits   :smiley:
Equations in Assembly: SmplMath

NoCforMe

Wow. I'm impressed. Also honored that you took the time to actually work through that example. I take that as some kind of compliment.

So after writing this, what do you think? Was it worthwhile? Do you think you might ever actually use this for a parsing task?

It'd be cool if you did, and to see what modifications you make (besides making the code 64-bit friendly).
Assembly language programming should be fun. That's why I do it.

HSE

Equations in Assembly: SmplMath

jj2007

Quote from: HSE on July 16, 2023, 08:58:45 AMProbably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits   :smiley:

Yeah, funny, isn't it :mrgreen: