Parsing Text file in Assembly Language

NoCforMe · July 12, 2023, 03:09:18 PM

No, it would be kind of a big deal. Here's the thing: the tokenizer in the demo (not the parser) looks for all non-numeric chunks of text as "identifiers". This includes

variable names
register names
the tokens "INVOKE" and "ADDR"

none of which contain spaces.

To allow constructs like [RAX + RDX].Table + 2, the tokenizer would have to be expanded to cover expressions within square brackets and arithmetic expressions. Plus the parser would have to be able to follow these sequences. You can see that this is definitely a non-trivial process.

The tokens for that expression would be

left square bracket ('[')
register (RAX)
plus sign ('+')
register (RDX)
right square bracket (']')
period ('.')
ID ('Table')
number ('2')

lingo · July 12, 2023, 03:18:53 PM

Before: buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0

Result: buffera db "Invoke someproc,[RDX].Table+12,[RAX+RBX+Table],[RAX+RBX+Table]",0,0,0,0

Code Select


.data
TableDividers db ",", ".", "+", "-", "*","/","(",")","[","]",0
buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0,0,0


.code
Parser proc

LOCAL SaveRCX :QWORD,SaveRDX: QWORD, SaveRBX: QWORDRD
mov    SaveRCX,rcx
mov    SaveRDX,rdx
mov    SaveRBX,rbx
               lea   rax,buffera           ; source 
               mov   rbx, rax		   ; dest 		
@@:
               mov   cl, byte ptr[rax]
               add   rax,1
               test  cl,cl   
               je    ende   
               cmp   cl,20h
               jne   @laba3
               cmp   byte ptr [rax],20h
     	       lea   rax,[rax+1] 
               je    @b 			  ; Skip more spaces
	       mov   ch,byte ptr [rax-3]
	       sub   rax,1 
               lea   rdx, TableDividers
@laba1:
               cmp   ch,byte ptr [rdx]
	       je    @b
	       cmp   byte ptr [rdx],0 ;not found
	       lea   rdx,[rdx+1] 
               jne   @laba1
	       mov   ch, byte ptr [rax]
               lea   rdx, TableDividers
@laba2:
               cmp   ch,byte ptr [rdx]
	       je    @b
	       cmp   byte ptr [rdx],0 ;not found
	       lea   rdx,[rdx+1] 
               jne   @laba2
@laba3:
	       mov   byte ptr [rbx],cl
               add   rbx,1
               jmp   @b
ende:
               mov   dword ptr [rbx],0
               lea   rax,buffera

mov rcx, SaveRCX
mov rdx, SaveRDX
mov rbx, SaveRBX
               ret
Parser         endp

NoCforMe · July 12, 2023, 03:24:56 PM

OK, well, maybe ... I can see it does work, but it's basically cheating, and isn't that only designed for that particular address expression?

Also, suggestion: some comments would help. I can't figure out from glancing at it just what the hell your code does.

But interesting. A+ for cleverness. (Oh, and I really like your animated avatar. Cuuute.)

NoCforMe · July 12, 2023, 04:06:05 PM

Getting back to my demo, for anyone who's interested in the guts of the thing, this is the data structure that drives the whole parsing process (after tokenization):

Code Select


;*****	The Parsing Sequence  *****
ParsingSequence	LABEL $Pnode

_pn0	$Pnode	<_pn1, $T_INVOKE, NULL>
	DD -1

_pn1	$Pnode	<_pn2, $T_ID, StoreFname>
	DD -1

_pn2	$Pnode	<_pn3, $T_comma, NULL>
	DD -1

_pn3	$Pnode	<_pn5, $T_ID, StoreArg>
	$Pnode	<_pn5, $T_number, StoreArg>
	$Pnode	<_pn5, $T_register, StoreArg>
	$Pnode	<_pn4, $T_ADDR, TagArgAsADDR>
	DD -1

_pn4	$Pnode	<_pn5, $T_ID, StoreArg>
	DD -1

_pn5	$Pnode	<_pn3, $T_comma, NULL>
	$Pnode	<NULL, $T_EOL, NULL>
	DD -1

That's all, a linked list of $Pnode structures. You can follow it through:

1. At node _pn0, see the token INVOKE, go to _pn1.
2. At node _pn1, see the token ID (function name), call StoreFname(), go to _pn2.
3. At node _pn2, see a comma, go to _pn3.
4. At node _pn3, see the token ID, call StoreArg(), go to _pn5
see the token number, call StoreArg(), go to _pn5
see the token register, call StoreArg(), go to _pn5
see the token ADDR, call TagArgAsADDR(), go to _pn4.
5. At node _pn4, see the token ID, call StoreArg(), go to _pn5.
6. At node _pn5, see a comma, go back to _pn3.
see the token EOL (end o'line), STOP (parser sees $T_EOL and ends processing).

So assuming your tokenizer gives you all the tokens your text contains, you only need to expand on this structure to do all kinds of parsing tasks (with some small stub subroutines to go along with it). That's the beauty of this method (if I don't say so myself).

HSE · July 16, 2023, 08:58:45 AM

Quote from: HSE on July 12, 2023, 02:02:53 PM
but I'm still in example 2

Ok! Second essay example is working, in a not very impressive way

:

Code Select

$parseSuccess

Press any key to continue ...

Just the skeleton to see how parser work in a debugger.

Because the code is already in the text, I added a little challenge: to make a Neutral Bitness Code (Friedrich et al. syntax).

Then you can build same code with ML or ML64, using MASM32 SDK or MASM64 SDK, resulting obviously a 32 or 64 bits binary file.

Probably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits

NoCforMe · July 16, 2023, 09:19:18 AM

Wow. I'm impressed. Also honored that you took the time to actually work through that example. I take that as some kind of compliment.

So after writing this, what do you think? Was it worthwhile? Do you think you might ever actually use this for a parsing task?

It'd be cool if you did, and to see what modifications you make (besides making the code 64-bit friendly).

HSE · July 16, 2023, 09:40:27 AM

Quote from: NoCforMe on July 16, 2023, 09:19:18 AM
So after writing this

It's just a first step to see how that work

jj2007 · July 16, 2023, 10:57:03 AM

Quote from: HSE on July 16, 2023, 08:58:45 AMProbably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits

Yeah, funny, isn't it

The MASM Forum

News:

Parsing Text file in Assembly Language

NoCforMe

lingo

NoCforMe

NoCforMe

HSE

NoCforMe

HSE

jj2007