The MASM Forum

General => The Workshop => Topic started by: NoCforMe on July 07, 2023, 12:27:01 PM

Title: Parsing Text file in Assembly Language
Post by: NoCforMe on July 07, 2023, 12:27:01 PM
I've written a paper that covers this topic, attached below.

This may or may not be what you're looking for. This is what I consider an excellent technique for parsing just about any kind of text. I've used it many times. It's based on assembly language but could be adapted for any other language. While it may look a bit complex, it's actually pretty simple and straightforward, once you get the basic concept.

There are actually two distinct phases here, the first being what's called "tokenization"--analyzing the text stream and separating elements into "tokens"--and the second is the actual parsing process (sometimes called "lexical analysis") where the stream of "tokens" is analyzed and actions taken by the parser depending on which tokens are seen.

Anyhow, you might give it a look-see, find out if this might work for you. Like I say, it's a very flexible tool. The really nice thing about it is that it isn't a mess of conditional statements, all nested and snarled up like a box of snakes: it's a table-driven process. Once you've diagrammed your parsing task and created the tokenization table, the code practically writes itself.

Note: Since the attachment (PDF) was larger than 512 KB, I had to break it into 2 pieces. The 2nd part is attached to the reply to this post.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 07, 2023, 12:28:20 PM
Here's the 2nd part of the PDF attachment. If this looks like something you'd want to use, or if you have questions, LMK.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 07, 2023, 11:59:04 PM
Quote from: NoCforMe on July 07, 2023, 12:27:01 PM
I've written a paper that covers this topic, attached below.

Look an impressive work  :thumbsup:
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 08, 2023, 01:06:29 AM
Quote from: NoCforMe on July 07, 2023, 12:27:01 PM
I've written a paper that covers this topic, attached below.

You put a lot of work into that :thumbsup:

However, it seems that sepult has lost interest :rolleyes:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 04:48:24 AM
Quote from: jj2007 on July 08, 2023, 01:06:29 AM
Quote from: NoCforMe on July 07, 2023, 12:27:01 PM
I've written a paper that covers this topic, attached below.

You put a lot of work into that :thumbsup:

However, it seems that sepult has lost interest :rolleyes:

That's OK; I realize this isn't for everybody. Eventually someone will come along here who finds this at least somewhat intriguing. We'll only have to wait, say, a couple years ...

Seriously, I'd be really stoked to see someone use this, since it has worked so well for me over the decades.
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 08, 2023, 06:00:49 AM
Interesting, it's the first step to write translators, preprocessors, converters, scripts, ... .

Basically you can provide characters allowed in the text, not allowed in the text, character that marks the end of line, character(s) used by the tokenizer. From there, words are registered that will be returned as identifiers.
Next comes logical precedence, priority of symbols over symbols. Extra functions can convert a hexadecimal string to hexadecimal number, in short, conversions.
In the glib library there is a lexical scanner, I used it a lot when I migrated to another OS.

Good job, reminded me of the red dragon book.

(https://m.media-amazon.com/images/I/51FWXX9KWVL._SX384_BO1,204,203,200_.jpg)
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 06:09:33 AM
I got the core idea for my parsing scheme from a computer science book I read back in the 1980s, forget exactly which one (it wasn't Knuth, which I also took a look at), which described the workings of a finite-state automaton (FSA). It was one of the few things in the book that wasn't completely over my head at the time.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 08, 2023, 06:15:12 AM
Art Of Assembly? Have a nice chapter about finite state machines.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 06:20:09 AM
Quote from: HSE on July 08, 2023, 06:15:12 AM
Art Of Assembly? Have a nice chapter about finite state machines.

No, the book had nothing to do with any language; it was a general computer science text.

Dang, wish I had it now; I might be able to understand more of it. It covered stuff like hashing, sparse-text tables, compiler construction, etc. Plus the usual sort algorithms, etc.
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 08, 2023, 07:08:55 AM
Quote from: NoCforMe on July 08, 2023, 06:20:09 AM
Dang, wish I had it now; I might be able to understand more of it. It covered stuff like hashing, sparse-text tables, compiler construction, etc. Plus the usual sort algorithms, etc.
I think can be Algorithms by Robert Sedgewick, Brown University, 1983-1984.
Great book.

Quote from: HSE on July 08, 2023, 06:15:12 AM
Art Of Assembly? Have a nice chapter about finite state machines.
AOA have a nice chapter dealing with boolean operators.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 08:56:00 AM
Quote from: mineiro on July 08, 2023, 07:08:55 AM
Quote from: NoCforMe on July 08, 2023, 06:20:09 AM
Dang, wish I had it now; I might be able to understand more of it. It covered stuff like hashing, sparse-text tables, compiler construction, etc. Plus the usual sort algorithms, etc.
I think can be Algorithms by Robert Sedgewick, Brown University, 1983-1984.
Great book.

Could be.

If you really want the book on computer programming, that would be Donald Knuth's multivolume The Art of Computer Programming. I only wish I could understand more than about 10% of it.

Quote from: HSE on July 08, 2023, 06:15:12 AM
Art Of Assembly? Have a nice chapter about finite state machines.

I wonder how close their technique for using them is to mine. Wouldn't be surprised if it was similar; only so many ways to skin a cat. (Sorry, kitty!)
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here. (Assuming it's not too complicated!) Be sure to specify exactly the text you need to be interpreted.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 08, 2023, 02:16:14 PM
I don't know what is OP. Here "opa" is a big person with some kind of mental deficiency, but good heart (not very used this days).

First we have to test your examples  :thumbsup:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 08, 2023, 02:33:54 PM
OP = original poster (they who started the thread)
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 08, 2023, 07:22:12 PM
Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here. (Assuming it's not too complicated!) Be sure to specify exactly the text you need to be interpreted.

I can offer a haystack, i.e. a fat text to be parsed: http://www.jj2007.eu/Bible.zip

Quote from: jj2007 on April 28, 2023, 04:35:21 PM
UnzipFile (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1234):

include \masm32\MasmBasic\MasmBasic.inc
  Init
  UnzipInit "http://www.jj2007.eu/Bible.zip" ; file or URL
  UnzipFile(0, "C:\Masm32") ; extract C:\Masm32\Bible.txt
EndOfCode

Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 09, 2023, 03:24:52 AM
Hi David,

Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here.

I have request  :biggrin:

Can you post complete first example? For what I'm reading, flow get stuck forever in last column:gt20:
    jmp qword ptr [rbx + TokenParseTbl + $tokenAnyOffset]
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 05:22:29 AM
Quote from: jj2007 on July 08, 2023, 07:22:12 PM
Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here. (Assuming it's not too complicated!) Be sure to specify exactly the text you need to be interpreted.
I can offer a haystack, i.e. a fat text to be parsed: http://www.jj2007.eu/Bible.zip

Quote from: jj2007 on April 28, 2023, 04:35:21 PM
UnzipFile (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1234):

include \masm32\MasmBasic\MasmBasic.inc
  Init
  UnzipInit "http://www.jj2007.eu/Bible.zip" ; file or URL
  UnzipFile(0, "C:\Masm32") ; extract C:\Masm32\Bible.txt
EndOfCode


OK, that's fine, but what's the particular needle you're searching for in that haystack?
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 09, 2023, 05:33:40 AM
Quote from: NoCforMe on July 09, 2023, 05:22:29 AMOK, that's fine, but what's the particular needle you're searching for in that haystack?

For example, all phrases that start with "Satan"?

The problem is here that the OP has lost interest, and we are just playing around here.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 06:17:43 AM
Quote from: jj2007 on July 09, 2023, 05:33:40 AM
Quote from: NoCforMe on July 09, 2023, 05:22:29 AMOK, that's fine, but what's the particular needle you're searching for in that haystack?

For example, all phrases that start with "Satan"?

OK, I may tackle that after I post the example that Hector requested.

QuoteThe problem is here that the OP has lost interest, and we are just playing around here.

Hmm, not really a problem. We're allowed to play around if we like.
After all, didn't you say that assembly should be fun?
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 06:27:30 AM
Quote from: HSE on July 09, 2023, 03:24:52 AM
Hi David,

Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here.

I have request  :biggrin:

Can you post complete first example? For what I'm reading, flow get stuck forever in last column:gt20:
    jmp qword ptr [rbx + TokenParseTbl + $tokenAnyOffset]


Here's the first example fleshed out. I left out some code that's needed (string matching and numeric conversion), but this should give the general idea. (You also need to provide a GNC(), but this is trivial, just get the next byte from the text buffer.)


$exitCodeSuccess EQU 0
$exitCodeError EQU -1

.data

TextBufferPtr DD ?

;====================================
; The Tokenization Table
;
; This is what drives the whole process
;====================================

TokenParseTbl LABEL DWORD

;     [#]    =     ,    EOL  [any]
; ------------------------------------------
DD _pnX, _pnX, _pnX, _pnX, _pnA ;0
DD _pnX, _pnB, _pnX, _pnX, _pn1 ;1
DD _pnC, _pnX, _pnX, _pnX, _pnX ;2
DD _pn3, _pnX, _pnD, _pnX, _pnX ;3
DD _pnE, _pnX, _pnX, _pnX, _pnX ;4
DD _pn5, _pnX, _pnX, _pnF, _pnX ;5

TokenParseChars DB '=', ',', $EOL
$numParseChars EQU $ - TokenParseChars

$tokenRow1 EQU ($numParseChars + 2) * 4 ;DWORD offset
$tokenRow2 EQU $tokenRow1 * 2
$tokenRow3 EQU $tokenRow1 * 3
$tokenRow4 EQU $tokenRow1 * 4
$tokenRow5 EQU $tokenRow1 * 5

$tokenAnyOffset EQU ($numParseChars + 1) * 4

TextBuffer DB 256 DUP(?)


.code

;====================================
; Tokenizer()
;
; Reads the text stream, breaks it into tokens.
; On return from a successful tokenization,
; the values "x" and "y" are stored.
;
; Returns:
; $exitCodeSuccess or
; $exitCodeError
;====================================

Tokenizer PROC

XOR EBX, EBX ;Start @ row 0.

parse0: CALL GNC ;Get next char. from buffer.
CMP AL, $CR ;Throw away carriage returns.
JE parse0
CMP AL, $tab ;Magically turn tabs into spaces.
JNE gt3
MOV AL, ' '

; 1. Weed out numeric digits:
gt3: CMP AL, '0'
JB gt10
CMP AL, '9'
JA gt10
JMP DWORD PTR [EBX + TokenParseTbl] ;[#] is 1st col. in  table row.

; 2. Try to match any parsing characters:
gt10: LEA EDI, TokenParseChars
MOV EDX, EDI ;Save pointer to list of chars.
MOV ECX, $numParseChars
REPNE SCASB
JNE gt20 ;No match, go to "any" col.

SUB EDI, EDX ;Get offset from start of list.
SHL EDI, 2 ;Convert to DWORD offset.
JMP DWORD PTR [EBX + EDI + TokenParseTbl]

; No match, so go to "any char." column:
gt20: JMP DWORD PTR [EBX + TokenParseTbl + $tokenAnyOffset]


;====================================
; Tokenizing Nodes
;====================================

;*****  Set up for ID storage:  *****
_pnA LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow1

; ... fall through to ...

;***** Store ID char.:  *****
_pn1 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

_pn2 LABEL NEAR ;"Do-nothing" node.
JMP parse0

;***** Match ID against string ("hotspot"):  *****
_pnB LABEL NEAR

; (insert code to do string matching here)
; Will also have to return an error if the string doesn't match "hotspot".
MOV EBX, $tokenRow2
JMP parse0

;*****  Set up for # storage:  *****
; (we use the same buffer for text and # storage)
_pnC LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow3

; ... fall through to ...

;***** Store # char.:  *****
; (notice this is the same as _pn1; we could re-use that node
; and save some space, but let's keep this for clarity)
_pn3 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

;***** Save "x" value:  *****
_pnD LABEL NEAR

; (insert code to convert ASCII digits to binary and store it
;  as the value for "x")
MOV EBX, $tokenRow4
JMP parse0

;*****  Set up for # storage:  *****
_pnE LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow5

. ... fall through to ...

;***** Store # char.:  *****
_pn5 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

;***** Save "y" value, exit w/success:  *****
_pnF LABEL NEAR

; (insert code to convert ASCII digits to binary and store it
;  as the value for "y")
MOV EAX, $exitCodeSuccess
RET

;***** Return error:  *****
_pnX LABEL NEAR
MOV EAX, $exitCodeError
RET

Tokenizer ENDP


One thing to realize is how easily this code flows from the diagram. Look at the diagram for that 1st example, and you should be able to see how the code for the tokenization "nodes" follows it. That's what I really like about this method, it kind of writes itself.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 09, 2023, 07:16:57 AM
Ok, there are no phrases that start with Satan in Bible.txt (http://www.jj2007.eu/Bible.zip), so I used The Lord instead :biggrin:

Source:
include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi=FileRead$("bible.txt")
  xor ecx, ecx
  .While 1
inc ecx ; increase start position
.Break .if !Instr_(ecx, esi, "The Lord", 4) ; full word, case-sensitive
mov edi, eax ; save start
lea ecx, [edx+8] ; advance index
.Repeat
dec eax
mov dl, [eax]
.Until dl==33 || dl=="?" || dl=="." || dl==10 || dl>"@" ; end.!? Satan
.if Zero?
mov eax, edi ; start
.Repeat
inc eax
mov dl, [eax]
.Until dl==33 || dl=="?" || dl=="." || dl==13 || !dl ; end of phrase
sub eax, edi
inc eax ; get the dot, too
.if eax>80
m2m eax, 80 ; too long for display
.endif
PrintLine Left$(edi, eax)
.endif
  .Endw
  Inkey "done"
EndOfCode


Output:
The Lord work a care and conscience in us to know Him and serve Him, that we may
The Lord of heaven and earth bless Your Majesty with many and happy days, that,
The Lord shall laugh at him: for he seeth that his day is coming.
The Lord gave the word: great [was] the company of those that published [it].
The Lord said, I will bring again from Bashan, I will bring [my people] again fr
The Lord at thy right hand shall strike through kings in the day of his wrath.
The Lord sent a word into Jacob, and it hath lighted upon Israel.
The Lord GOD hath given me the tongue of the learned, that I should know how to
The Lord GOD hath opened mine ear, and I was not rebellious, neither turned away
The Lord GOD which gathereth the outcasts of Israel saith, Yet will I gather [ot
The Lord hath trodden under foot all my mighty [men] in the midst of me: he hath
The Lord hath swallowed up all the habitations of Jacob, and hath not pitied: he
The Lord was as an enemy: he hath swallowed up Israel, he hath swallowed up all
The Lord hath cast off his altar, he hath abhorred his sanctuary, he hath given
The Lord GOD hath sworn by his holiness, that, lo, the days shall come upon you,
The Lord GOD hath sworn by himself, saith the LORD the God of hosts, I abhor the
The Lord then answered him, and said, [Thou] hypocrite, doth not each one of you
The Lord [is] at hand.
The Lord [be] with you all.
The Lord give mercy unto the house of Onesiphorus; for he oft refreshed me, and
The Lord grant unto him that he may find mercy of the Lord in that day: and in h
The Lord Jesus Christ [be] with thy spirit.
The Lord knoweth how to deliver the godly out of temptations, and to reserve the
The Lord is not slack concerning his promise, as some men count slackness; but i
done
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 09, 2023, 07:30:41 AM
Quote from: NoCforMe on July 09, 2023, 06:27:30 AMit kind of writes itself.

:biggrin: I'm very used to "a mess of conditional statements"

Thanks, that will help  :thumbsup:

HSE
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 07:30:53 AM
JJ, try to think of a better example. What you showed is basically a text search; how about something where you're extracting values from a text construct, along the lines of my "hotspot=<x>,<y>", but a little less trivial. Something I can get my parsing teeth into.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 09, 2023, 07:33:16 AM
Sorry, I don't understand what you mean - maybe you should give it a try?

Honestly, I am curious to see a concrete example of your approach.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 09, 2023, 09:07:12 AM
 :biggrin: that look better, in code:;*****  Set up for ID storage:  *****
_pnA LABEL NEAR
    lea rcx, TextBuffer       
    mov TextBufferPtr, rcx
    mov rbx, $tokenRow1

; ... fall through to ...


because in article say:***** Set up for ID storage: *****
_pnA LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow1
JMP parse0

wich make me lost the path  :rolleyes:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 09:18:54 AM
Yes, that is a little misleading. I should probably provide the entire example in the article to make that more clear.
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 09, 2023, 10:16:01 AM
I give to you a challenge sir NoCforMe.  I understood how much powerfull it's what you posted. This can be a great project if persons want to create plugins to some text file editors.

Translate line bellow:
invoke function, addr something, 1, 2, 3

to strings:
mov r9,3
mov r8,2
mov rdx,1
lea rcx, addr something
call function

Rules:
addr inside invoke ID should be translated as "lea" instead of "mov".
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 09, 2023, 10:24:56 AM
really sorry, I can't modify my error:
So,
lea rcx, addr something

to:
lea rcx, something

Well, 7 beer effect.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 10:26:43 AM
Excellent challenge! Just what the doctor ordered.

I'm on it.

First task: clearly identify the problem.
To generalize, we have as input:

INVOKE f, s, a1, a2, a3

where
OK? So far so good. We want to transform this to the form you gave in your challenge.

Since the first task of parsing is tokenization, we need to define the universe of tokens here:
That's it. We don't need to include all the possible register names because those will simply be treated as "unknown IDs", just like any other variable.

I'll work on the tokenizer and get back when it's done.
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 09, 2023, 11:21:05 AM
Sir, you went further than I asked, make things simple so people can understand how powerful a lexical scanner is.
If you want to create what you proposed, then release two versions.
Leave the rest to modular programming, people's cognitive parallax.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 11:37:38 AM
Actually it's no more complex than your original spec, except for handling hex #s, which is only slightly more complicated.

Here's my FSA diagram for the tokenizer. This is probably the most important part of the process--drawing the tokenizer. It needs to be done on paper. All the code you'll write will come from this picture.

Should be self-explanatory. This will become a subroutine called GetNextToken(). The top part (nodes A, 1 and B) tokenizes identifiers. Nodes G and H return EOL and comma respectively. The rest of it is for numbers (decimal or hex).
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 09, 2023, 12:18:53 PM
yes, like that.
Tabulations (09h) and spaces(20h) inside this invoke logic (parser) should be look as ignore char (increase pointer). The symbol "," will be the tokenizer.
If inside invoke context, previoussss valid char before the eol was "\", so continue to next line (ignore eol) until not found "\" (real eol).
Some persons write:
invoke function,\               ;some comments, there's tabs and spaces separating this
        addr something,\        ;another
        1,\
        2,\
        3

That's all. If an user hits:
db "invoke something"
So, that will be outside this "scope" and that scope will treat invoke as literal because quotes, symbols precedence, or, other scope, nothing to do here.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 05:40:16 PM
Here's a demo tokenizer, attached below. It's a console program that prompts you for a statement, then tokenizes it (or at least attempts to), and displays the results. Was kinda fun to make and to use. Try it out.

You'll notice that it does a great job of recognizing tokens, but it's dumb as a stump when it comes to making any kind of sense out of what you type. All of the following will successfully pass through it:
That's because it's just a dumb tokenizer that doesn't know anything about the "grammar" of the statement it's working on, meaning what needs to go where for the statement to make sense. That'll happen during the next task. All it knows is how to break apart the statement into its "atomic" parts. But we've accomplished a lot so far. Stay tuned.

The two important things to look at here in the source are the tokenization table (TokenParseTbl) and associated data, and the GetNextToken() subroutine which does the tokenization. The rest of the stuff is support functions: getting characters out of the buffer, matching IDs, converting #s, uppercasing characters, etc.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 09, 2023, 06:17:09 PM
OK, last thing for tonight: here's the parsing sequence for all those tokens we can now extract. This defines the "grammar" of the statement. I've followed Hector's "challenge" here, but expanded a bit on the last 3 arguments, which can be any of a variable name, the ADDR of a variable, a number (decimal or hex), or the name of a register. It's very easy to accommodate these without a ton of spaghetti code, since this thing is data-driven as you'll see. But first I have to get some sleep ...

Again, it's very important that you put this down on paper first before typing one single character into your code.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 09, 2023, 08:22:26 PM
My take on the tokeniser (full source of 76 lines attached). There is a text file with samples attached, edit at your convenience; it contains lines like this:

InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment

Here is the tokeniser:

GetToken proc ; expects string in esi
  .Repeat
lodsb
  .Until al>="@" || (al>="0" && al<="9") || al==59
  dec esi ; esi on someApi
  mov eax, esi
  .Repeat
inc eax
mov dl, [eax]
  .Until dl=="," || !dl || dl==59 || dl==13 ; 59=comment
  push edx ; last byte (could be zero or carriage return)
  .Repeat
dec eax
mov dl, [eax]
  .Until dl>="@" || (dl>="0" && dl<="9") ; eax points to trailing bytes
  inc eax
  push eax ; addr last valid byte
  sub eax, esi
  xchg eax, ecx
  Let t$(tCt)=Left$(esi, ecx)
  inc tCt
  pop esi 
  pop eax
  and eax, 11110010b ; stop for Cr=1101b and nullbyte but not for comma or comment
  ret
GetToken endp


Sample output (yes, the assembler would throw two errors):
-------- Sample text: --------
invoke MyAlgo, eax, 123, 456

mov rcx, eax
mov rdx, 123
mov r8, 456
call MyAlgo

-------- Sample text: --------
InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment
line two

mov rcx, 111h
mov rdx, eax
lea r8, mytext
mov r9, 123
push 456
lea rax, MyVar
push rax
lea rax, MyVar2
push rax
call someApi123
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 10, 2023, 01:40:36 AM
Quote from: NoCforMe on July 09, 2023, 06:27:30 AM
One thing to realize is how easily this code flows from the diagram.

Mmm  :biggrin:

How it move from B to C? (first example)
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 10, 2023, 02:31:24 AM
Quote from: jj2007 on July 09, 2023, 08:22:26 PM
Sample output (yes, the assembler would throw two errors):
Or maybe three:

mov rcx, eax

line two

push 456
push rax
push rax
3 pushed itens in stack (dword,qword,qword), rsp register will be unaligned before instruction call supposing expansion to 3 qwords..
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 10, 2023, 03:36:15 AM
Quote from: mineiro on July 10, 2023, 02:31:24 AM
Quote from: jj2007 on July 09, 2023, 08:22:26 PM
Sample output (yes, the assembler would throw two errors):
Or maybe three:

mov rcx, eax

line two

push 456
push rax
push rax
3 pushed itens in stack (dword,qword,qword), rsp register will be unaligned before instruction call supposing expansion to 3 qwords..

- The "line two" was just a test if the tokeniser recognises a carriage return correctly.
- A push 123 is not DWORD, it's qword.
- Re stack alignment, in real code, a CreateWindowEx looks like this:

invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

000000014000117C | 48:836424 58 00            | and [rsp+58],0                  |
0000000140001182 | 4C:8B15 C31E0000           | mov r10,[14000304C]             |
0000000140001189 | 4C:895424 50               | mov [rsp+50],r10                |
000000014000118E | 48:C74424 48 6F000000      | mov [rsp+48],6F                 | 6F:'o'
0000000140001197 | 44:8B55 10                 | mov r10d,[rbp+10]               |
000000014000119B | 4C:895424 40               | mov [rsp+40],r10                |
00000001400011A0 | 48:C74424 38 01000000      | mov [rsp+38],1                  |
00000001400011A9 | 48:C74424 30 01000000      | mov [rsp+30],1                  |
00000001400011B2 | 48:836424 28 00            | and [rsp+28],0                  |
00000001400011B8 | 48:836424 20 00            | and [rsp+20],0                  |
00000001400011BE | 41:B9 C4013050             | mov r9d,503001C4                |
00000001400011C4 | 45:33C0                    | xor r8d,r8d                     | r8d:"flags=3"
00000001400011C7 | 48:8D15 E11E0000           | lea rdx,[1400030AF]             | 00000001400030AF:"RichEdit20A"
00000001400011CE | B9 00020000                | mov ecx,200                     |
00000001400011D3 | FF15 A3330000              | call [<&CreateWindowExA>]       |
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 10, 2023, 05:38:38 AM
OK, I see I have some competition here. I'll persist in my project nonetheless

This is a learning experience for me as well; going over my design, I can see some flaws that should be corrected (in the next version):
Probably there are other issues I haven't discovered yet.
Hopefully later today I'll have a full parser coded and posted here. We'll see.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 10, 2023, 09:24:22 AM
Quote from: NoCforMe on July 10, 2023, 05:38:38 AMit's possible to have nonsensical constructs like ADDR rsi.

Right, but that's going one step further: a syntax check, plus error messages...
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 10, 2023, 09:48:28 AM
Here's the parsing demo, attached below.

It works, and I believe it meets Hector's challenge (he'll have to be the judge of that). As I said, there are some problems with it, and certainly room for refinement. I'll put out another version soon that addresses at least some of these issues.

Notice that when entering any of the last 3 variables, they can be:
Notice that it correctly handles the difference between a variable value and its address:

varname--> MOV [register], varname
ADDR var--> LEA [register], varname


Also, with Hector's permission, I would allow the first parameter to be any of these things. I don't see any reason why it should be limited to ADDR varname. (But then I don't do 64-bit programming.)

I don't like the way it handles numbers. if you enter a # in hex it spits it out in decimal. The correct way to handle this will be to preserve the original text the user entered, including negative numbers, which will require changes to the tokenizer. That will also simplify the code by eliminating the numeric conversion routines.

* I learned something from this, which is that MASM accepts negative hex numbers. I didn't know this, as I've never used them.

One problem is that you can still enter nonsensical constructs, like ADDR RSI. because it doesn't know a register name from a hole in the ground. So I'll add all the register names to the list of known IDs and tag them as a register type to avoid this problem.

Let me know whatcha think.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 10, 2023, 09:52:48 AM
Enter statement to test: >InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar

Tokenization error.

Enter statement to test: >invoke someApi, 111h, eax

Tokenization error.


What's wrong?

The previous version (INVOKEtokenizer) works fine:
Enter statement to test: >invoke MyAlgo, eax, 123, 456

TOKEN: INVOKE
TOKEN: Unknown ID: "MyAlgo"
TOKEN: comma
TOKEN: Unknown ID: "eax"
TOKEN: comma
TOKEN: number (123)
TOKEN: comma
TOKEN: number (456)
TOKEN: EOL. Tokenization completed successfully!
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 10, 2023, 09:55:43 AM
Quote from: HSE on July 10, 2023, 01:40:36 AM
Quote from: NoCforMe on July 09, 2023, 06:27:30 AM
One thing to realize is how easily this code flows from the diagram.

Mmm  :biggrin:

How it move from B to C? (first example)

Sorry I missed your question before. You're talking about that trivial 1st example, right? if you're on node B, seeing a numeric character (#) will move you to node C.

Be sure to check out the parsing demo I just posted.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 10, 2023, 09:59:07 AM
Quote from: jj2007 on July 10, 2023, 09:52:48 AM
Enter statement to test: >InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar

Tokenization error.

Enter statement to test: >invoke someApi, 111h, eax

Tokenization error.


What's wrong?

Sorry, should have been more explicit.

I'm following Hector's challenge pretty closely, where the first argument must be ADDR var. Try that. You also have to give 4 arguments, the last 3 of which can be any of what I described above.

Here's the template:

INVOKE function, ADDR var1, var2, var3, var4

There's no reason it has to work that way; I'm just following the format of the challenge. I'll modify it to accept any number of arguments.

Additional enhancement: For the sake of completeness, it should handle binary #s too (10100011B). Version after the next, maybe.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 10, 2023, 10:29:37 AM
Never mind, I also just discovered a little glitch in my version :tongue:

invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

mov rcx, WS_EX_CLIENTEDGE
mov rdx, Chr$("RichEdit20A
mov r8, NULL
mov r9, reStyle
push 0
push 0
push 1
push 1
push hWnd
push ID_EDIT
push wcx.hInstance
push NULL
call CreateWindowEx


Correct version attached.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 10, 2023, 11:53:56 AM
Just in case, is mineiro challenge  :thumbsup:

I'm still in B to C trivial thing  :biggrin:
(solved now!!)
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 10, 2023, 07:00:28 PM
A few lines for testing your algos:

invoke MyAlgo, eax, "a string", FP4(456.789)
invoke MyAlgo, eax, 123, 456
InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment
invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL


Post yours, too, please :thup:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 05:55:15 AM
This version
I guess that's all for this version. I'd like to point out that these changes actually reduced the size of the code overall. (No more numeric conversion required, for one thing: numbers are handled as text.)
Some of the recent changes didn't require any change to the code. Instead, all the changes were to data structures, which to me is the beauty of this technique: it's data driven. Yes, it's a bit more complicated than a nested mess of conditional code statements, but once you get it running, it's so easy to expand it or make changes to the "grammar" of what you're parsing.

I hope someone tests this out and reports back to us.

Next version: remove the fixed requirement for 4 arguments, let it handle n arguments (for some reasonable limit of n, say 8 or 10).
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 06:05:12 AM
Are there still special requirements for the invokes?

INVOKE--> code parser demo, version 2
Allows dec/hex/binary #, registers, var or ADDR var for all 4 args.

Enter statement to test: >invoke MyAlgo, eax, "a string", FP4(456.789)

Tokenization error.

Enter statement to test: >invoke MyAlgo, eax, 123, 456

Tokenization error.

Enter statement to test: >InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment

Tokenization error.

Enter statement to test: >invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

Tokenization error.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 06:12:09 AM
JJ: Aaaaargh. You're pushing the boundaries to the breaking point.

Here's the rulez:

Format:  INVOKE function, arg1, arg2, arg3, arg4
where all args can be any of the following:
All case-insensitive. 4 arguments, no more, no less. No floating-point stuff. NO STRINGS ALLOWED! This is just a demo (besides, how would a string make sense here?)

Curiously, it should accept Chr$("RichEdit20A") as a valid identifier.

Now try it again.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 06:18:13 AM
 :biggrin: a little obvious test.

Enter statement to test: >Enter statement to test: >INVOKE function, arg1, arg2, arg3, arg4

        MOV     R9, arg1
        MOV     R8, arg2
        MOV     RDX, arg3
        MOV     RCX, arg4
        CALL    function
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 06:21:49 AM
That one worked, thanks Hector :thup:

Enter statement to test: >INVOKE function, arg1, arg2, arg3, arg4

        MOV     R9, arg1
        MOV     R8, arg2
        MOV     RDX, arg3
        MOV     RCX, arg4
        CALL    function

Enter statement to test: >INVOKE whatever, asdasd, 123, ecx

Tokenization error.


I just tested my version with invoke strings extracted from your source:
INVOKE  WinMain, EAX
INVOKE  ExitProcess, EAX
INVOKE  StdOut, OFFSET ProgramHeading
INVOKE  StdIn, OFFSET InputBuffer, SIZEOF InputBuffer
INVOKE  StdOut, OFFSET CRLFstr
INVOKE  wsprintf, ADDR buffer, OFFSET CALLfmt,
INVOKE  StdOut, ADDR buffer
INVOKE  strcmpi, OFFSET TextBuffer, [EBX].$T_entry.T_IDptr
INVOKE  strcpy, ECX, OFFSET FnameStorage
INVOKE  strcpy, ECX, OFFSET Var1Storage + 1
INVOKE  strcpy, ECX, OFFSET Var2Storage + 1
INVOKE  strcpy, ECX, OFFSET Var3Storage + 1
INVOKE  strcpy, ECX, OFFSET Var4Storage + 1


And I found a little bug :sad:

It's fixed, see attached version 3.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 06:24:01 AM
Enter statement to test: >INVOKE function, RDX, RCX, RAX, RBX

Tokenization error.


Also X64 ABI is inverted in result.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 06:25:49 AM
Quote from: jj2007 on July 11, 2023, 06:21:49 AM
Enter statement to test: >INVOKE whatever, asdasd, 123, ecx

Tokenization error.


Y'see, that one failed for a reason: it violated the grammar defined for the statement. (Which I gave above.) That's exactly what a parser is spozed to do.
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 06:27:07 AM
Quote from: HSE on July 11, 2023, 06:24:01 AM
Enter statement to test: >INVOKE function, RDX, RCX, RAX, RBX

Tokenization error.


Also X64 ABI is inverted in result.
Aaaaargh; that one shoulda worked. Back to the lab.
Fixed. Updated code attached to previous reply up there.

Regarding the correct order of args and registers: I'll fix that in the next, generalized version.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 06:36:27 AM
Quote from: NoCforMe on July 11, 2023, 06:25:49 AM
Quote from: jj2007 on July 11, 2023, 06:21:49 AM
Enter statement to test: >INVOKE whatever, asdasd, 123, ecx

Tokenization error.


Y'see, that one failed for a reason: it violated the grammar defined for the statement. (Which I gave above.) That's exactly what a parser is spozed to do.

Quote from: NoCforMe on July 11, 2023, 06:12:09 AM
Here's the rulez:

Format:  INVOKE function, arg1, arg2, arg3, arg4
where all args can be any of the following:

  • The name of a variable
  • The address of a variable (ADDR varname)
  • A register name (only 64-bit names allowed)
  • A decimal, hex or binary number (incl. negatives)
All case-insensitive. 4 arguments, no more, no less. No floating-point stuff. NO STRINGS ALLOWED!

INVOKE whatever, asdasd, 123, ecx failed, but
INVOKE whatever, asdasd, 123, rcx, 456 worked, congrats :thumbsup:

P.S.: Here is my output for the invoke strings extracted from your source, see version 3 above:

-------- Sample text: --------
INVOKE  WinMain, EAX

mov rcx, EAX
call WinMain

-------- Sample text: --------
INVOKE  ExitProcess, EAX

mov rcx, EAX
call ExitProcess

-------- Sample text: --------
INVOKE  StdOut, OFFSET ProgramHeading

mov rcx, OFFSET ProgramHeading
call StdOut

-------- Sample text: --------
INVOKE  StdIn, OFFSET InputBuffer, SIZEOF InputBuffer

mov rcx, OFFSET InputBuffer
mov rdx, SIZEOF InputBuffer
call StdIn

-------- Sample text: --------
INVOKE  StdOut, OFFSET CRLFstr

mov rcx, OFFSET CRLFstr
call StdOut

-------- Sample text: --------
INVOKE  wsprintf, ADDR buffer, OFFSET CALLfmt,

lea rcx, buffer
mov rdx, OFFSET CALLfmt
mov r8, ñm³v
call wsprintf

-------- Sample text: --------
INVOKE  StdOut, ADDR buffer

lea rcx, buffer
call StdOut

-------- Sample text: --------
INVOKE  strcmpi, OFFSET TextBuffer, [EBX].$T_entry.T_IDptr

mov rcx, OFFSET TextBuffer
mov rdx, [EBX].$T_entry.T_IDptr
call strcmpi

-------- Sample text: --------
INVOKE  strcpy, ECX, OFFSET FnameStorage

mov rcx, ECX
mov rdx, OFFSET FnameStorage
call strcpy

-------- Sample text: --------
INVOKE  strcpy, ECX, OFFSET Var1Storage + 1

mov rcx, ECX
mov rdx, OFFSET Var1Storage + 1
call strcpy

-------- Sample text: --------
INVOKE  strcpy, ECX, OFFSET Var2Storage + 1

mov rcx, ECX
mov rdx, OFFSET Var2Storage + 1
call strcpy

-------- Sample text: --------
INVOKE  strcpy, ECX, OFFSET Var3Storage + 1

mov rcx, ECX
mov rdx, OFFSET Var3Storage + 1
call strcpy

-------- Sample text: --------
INVOKE  strcpy, ECX, OFFSET Var4Storage + 1

mov rcx, ECX
mov rdx, OFFSET Var4Storage + 1
call strcpy


I should fix the mov rcx, EAX stuff :rolleyes:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 06:59:36 AM
So, a couple general questions here:

1. What's the purpose of this conversion, anyhow? I mean apart from being the basis of a parsing demo, which I'm all in on.
What's wrong with using the INVOKE macro as-is? Why would someone want to unroll the code this way?

2. My next version of the demo will generalize it, so it won't be limited to strictly 4 arguments as it is now. But after looking at the x64 calling convention, I think I'll stop here:
I think that's reasonable; after all, this is a parsing demo, not an exhaustive example of x64 usage. (Hell, I don't even use any 64-bit stuff myself!)

Anyhow, next (and probably last) version coming soon ...
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 07:24:01 AM
Quote from: NoCforMe on July 11, 2023, 06:59:36 AM
basis of a parsing demo, which I'm all in on.

:thumbsup: Too much complexity is hard to follow. Perhaps demo have to reach difficulty just enough to be the third example in your essay, and no more.


Quote from: NoCforMe on July 11, 2023, 06:59:36 AMI don't even use any 64-bit stuff myself

Perhaps you haven't seen masm64 SDK invoke macro. That can be written a little more beauty:
    invoke MACRO fname:REQ,args:VARARG
      procedure_call fname,args
    ENDM

    procedure_call MACRO fname:REQ,a1:VARARG

      LOCAL lead,wrd2,ssize,sreg,svar
       
      arg1_n = 0 
      FOR arg2, <a1>

        ;; **************************
        ;; first 4 register arguments
        ;; **************************
          IF arg1_n eq 0
            REGISTER arg2,cl,cx,ecx,rcx,xmm0
          ENDIF       
          IF arg1_n eq 1
            REGISTER arg2,dl,dx,edx,rdx,xmm1
          ENDIF
          IF arg1_n eq 2
            REGISTER arg2,r8b,r8w,r8d,r8,xmm2
          ENDIF
          IF arg1_n eq 3
             REGISTER arg2,r9b,r9w,r9d,r9,xmm3
          ENDIF
        ;; **************************
        ;; following stack arguments
        ;; **************************
          IF arg1_n gt 3
            STACKARG arg2,arg1_n*8
          ENDIF

          arg1_n = arg1_n + 1
      ENDM

      call fname

    ENDM


Not so ugly spaghetti :biggrin:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 07:31:45 AM
Quote from: HSE on July 11, 2023, 07:24:01 AM
Quote from: NoCforMe on July 11, 2023, 06:59:36 AM
basis of a parsing demo, which I'm all in on.

:thumbsup: Too much complexity is hard to follow. Perhaps demo have to reach difficulty just enough to be the third example in your essay, and no more.
Well, I apologize for that. But the complexity here follows from the requirements of the demo, which are not trivial. My hope is that the underlying concepts--using a FSA for tokenization and a linked list for parsing--will somehow reveal themselves to the curious here despite the complexity. And hey, it's not that complicated!.
Quote
Not so ugly spaghetti :biggrin:

Yes, but that isn't a parser. Mine is.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 07:44:39 AM
Quote from: NoCforMe on July 11, 2023, 07:31:45 AM
Yes, but that isn't a parser. Mine is.

  :thumbsup: Just kind of "lexical analysis" (Masm have the macro tokenizer) 
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 07:51:14 AM
Quote from: HSE on July 11, 2023, 07:24:01 AMPerhaps you haven't seen masm64 SDK invoke macro.

Perhaps you haven't seen the jinvoke macro - it's a factor 12 bigger but works with MASM, UAsm and AsmC, checks argument counts and types of arguments, just like the original 32-bit MASM invoke macro :biggrin:

jinvoke MACRO apiarg, args:VARARG
Local tmp$, tmpA$, api$, apx$, apinum, dllnum, info$, inf1$, c1$, is, isCrt, isXmm, oa
Local isR, curSlot, curApi, rev$, isO, isOl, ctArgs, rspExtra, cVarArg, pushReg
; tmp$ CATSTR <jinv &apiarg&, line >, %@Line,  < with _jbInit=>, %jbInit, <, _jbPBI=>, %jbPBI, <, _jbPrologRun=>, %jbPrologRun
; % echo tmp$
ifdef needsnop
mov rbp, rbp
nops 4
endif
  ife jbPrologRun
if @64
; nop
endif
  endif
  api$ CATSTR <apiarg>
  isCrt INSTR api$, <crt_>
  if isCrt
api$ SUBSTR api$, 5
  endif
  apx$ CATSTR <j@>, api$
  ; % echo ____ api$ -> apx$
  is INSTR 1, apx$, </>
  ife is
tmp$ CATSTR <LABEL >, <apiarg>
% echo ____ LABEL apiarg uses invoke apx$ ____
invoke apiarg, args
  else
;   % echo -------- Hello.... DefPrc$ in [api$] or [apiarg]
  isR INSTR DefPrc$, api$ ; j#myalgo#s1441144
  if isR eq 3
    % echo -- Hello DP: [DefPrc$] and info: [info$]
    ; .err ; info$
apinum equ <-1>
isR INSTR isR+1, DefPrc$, <#>
dllnum equ <15000>
info$ SUBSTR DefPrc$, isR+1
info$ CATSTR info$, <xxxxxxxxxxxxxxxxxxxx>
  else
apinum SUBSTR apx$, 1, is-1
isR INSTR is+1, apx$, <:>
dllnum SUBSTR apx$, is+1, isR-is-1
info$ SUBSTR apx$, isR+1
; % echo info: [info$] ; s1441...
info$ CATSTR info$, <xxxxxxxxxxxxxxxxxxxx>
  endif
; tmp$ CATSTR <api >, <apiarg>, < has ID >, %apinum, < and DLL >, %dllnum, <, info=>, info$
; % echo tmp$
; define a new variable jdll@123
tmp$ CATSTR <jdll@>, dllnum
is = opattr tmp$
if is eq 36 ; immediate
curSlot=tmp$ ; already defined
else
curSlot=jaCtDll
if dllnum ge 0 ; <0 is own proc
tmp$ CATSTR tmp$, <=>, %jaCtDll
tmp$
tmp$ CATSTR <jd@>, %dllnum ; jd@3 equ advapi32
; % echo ## DLL: tmp$
tmp$ CATSTR <txDll>, %curSlot, < equ db ">, tmp$, <", 0>
; % echo tmp$
tmp$
jaCtDll=jaCtDll+1
endif
endif
; define a new variable j@123
tmp$ CATSTR <j@>, %apinum
is = opattr tmp$
if is eq 36
; echo ####### tmp$ already defined ###### ; immediate
curApi equ tmp$
else
is INSTR tmp$, <j@> ; jTypeChk follows below
if is
tmp$ CATSTR tmp$, <=>, %jaCtApi
% tmp$ ; the % is for ML64 (erratic errors)
if apinum gt 50000
if jbVerbose
    % echo own api$
endif
curApi equ <-1>
else
tmp$ CATSTR <txApi>, %jaCtApi, < equ ap>, %jaCtApi, <$ db curSlot+1, ">, api$, <", 0>
if jbVerbose
    % echo Win api$
endif
% tmp$ ; the % is for ML64 (erratic errors)
curApi=jaCtApi
jaCtApi=jaCtApi+1 ; this is total, not current
endif
else
curApi equ tmp$
endif
endif
ifidn <args>, <@>
call iaApi[SIZE_P*curApi]
EXITM
elseifidn <args>, <@def@>
EXITM
elseifidn <args>, <@address>
mov rax, iaApi[SIZE_P*curApi]
EXITM
endif
isR=0
cVarArg INSTR info$, <c6x>
rev$ equ <# >
for arg, <args> ; REVERSE
  isR=isR+1
  tmp$ CATSTR <arg>, <    >
  tmp$ SUBSTR tmp$, 1, 4
  rev$ CATSTR <arg>, <#>, rev$
  inf1$ SUBSTR info$, isR+1, 1
  ifidn inf1$, <x>
  ife cVarArg
  ; % echo info$/inf1$
tmp$ CATSTR <## line >, %@Line, <: too many arguments for &apiarg& ##>
% echo tmp$
.err
  endif
  endif
  ifdifi tmp$, <addr>
  if @InStr(1, <arg>, <&>) ne 1
  if @InStr(1, <arg>, <*>) ne 1
  if type arg eq REAL8
  ifdif inf1$, <3> ; :s131
  ;.err <## REAL8 not expected ##>
  endif
  elseif type arg eq QWORD
  ifidn inf1$, <3>
  ; % echo info$/inf1$
  .err <## REAL8 expected ##>
  endif
  endif
  endif
  endif
  endif
endm
is INSTR info$, <x>
if is gt isR+2
  ife cVarArg
  ; tmp$ CATSTR <i=>, %is, <, r=>, %isR
; % echo tmp$
  ; % echo info$/inf1$
  tmp$ CATSTR <## line >, %@Line, <: not enough arguments for &apiarg& ##>
  % echo tmp$
  .err
endif
endif
ctArgs=isR
rspExtra=0
if @64
if ctArgs GT 4
is=ctArgs mod 4
rspExtra=ctArgs/4
ife jbCompStyle
REPEAT 4-is
push rbx ; r8 is a 2-byter, so we take rbx
ENDM
endif
elseif ctArgs LT 4 ; can be merged
ife jbCompStyle
repeat 4-ctArgs ; 1...3 dummy pushes, rest pushed below
    push rbx
endm
endif
endif
  if isR GT jbArgsUsed+20 ; ?????
  jbArgsUsed=isR+20
  .err <isr gt argsused>
  endif
endif
; tmp$ CATSTR <rev=>, rev$
; % echo tmp$
; if usedeb
; mov rsi, rsi ; for debugging
; int 3
; endif
; % echo API: api$, INFO: info$
  ; mov rsp, rsp ; ----- start moving args into stack ---------
While isR ; push in right order: rcx rdx r8 r9 pushed5 pushed6 etc
  isR=isR-1
  is INSTR rev$, <#>
  tmp$ SUBSTR rev$, 1, is-1
  c1$ SUBSTR rev$, 1, 1 ; only for & and *
  tmpA$ CATSTR tmp$, <    >
  tmpA$ SUBSTR tmpA$, 1, 4
isOl=0 ; 0=no addr, offset, * or &
  ifidni tmpA$, <offs>
isOl=7 ; substr must compensate offset characters
  elseifidni tmpA$, <addr>
  isOl=5 ; substr must compensate addr characters
  elseifidn c1$, <&>
  isOl=1
  elseifidn c1$, <*>
  isOl=1
  endif
  if @64
  pushReg equ <r10>
  pushRegD equ <r10d>
  if isR eq 0
  pushReg equ <rcx>
  pushRegD equ <ecx>
  elseif isR eq 1
  pushReg equ <rdx>
  pushRegD equ <edx>
  elseif isR eq 2
  pushReg equ <r8>
  pushRegD equ <r8d>
  elseif isR eq 3
  pushReg equ <r9>
  pushRegD equ <r9d>
  endif
  csDest equ [rsp+8*isR] ; [rbp+x] is same size in X64
  if jbCompStyle ; always, it's the default now
  c1$ SUBSTR info$, isR+2, 1
; if apinum gt 50000 and usedeb
; oa INSTR info$, <x>
; if oa
; tmpx$ SUBSTR info$, 1, oa
; else
; tmpx$ CATSTR info$
; endif
; oa = (opattr tmp$) AND 127
;   tmpx$ CATSTR <Count=>, %isR, <: arg=[>, tmp$, <], c=>, c1$, < in >, tmpx$, <, o=>, %oa
; % echo tmpx$
; endif
noMem=4 ; and (useCB eq 0)
  if isOl
  tmp$ SUBSTR tmp$, 1+isOl
  ; if jbVerbose
  ; % echo off tmp$ ; not very useful
  ; endif
  lea pushReg, tmp$ ; addr or offset
  if isR ge noMem
  mov csDest, pushReg
  endif
  elseif type(tmp$) eq REAL8 ; REAL8 to xmm
  jTypeChk cVarArg, isR, c1$, <4REAL8>, api$
  if isR ge noMem
  movlps xmm0, tmp$ ; real8 to xmm0 (no conversion)
  movlps qword ptr csDest, xmm0
  else ; first 4 in xmm? and rcx rdx r8 r9
  ; % echo DEST: csDest
  tmp2$ CATSTR <movlps xmm>, %isR, <,>, tmp$
  ; % echo **** Passing a REAL8: tmp2$
  % tmp2$
  if cVarArg ; Parameter passing: Floating-point values are only placed in the integer registers RCX, RDX, R8, and R9 when there are varargs arguments
  tmp2$ CATSTR <movd >, pushReg, <, xmm>, %isR
  ; % echo **** Passing a REAL8 both to xmm? and reg64: tmp2$
  % tmp2$
  endif
  endif
  elseif type(tmp$) eq REAL4 ; REAL4 to xmm
  jTypeChk cVarArg, isR, c1$, <3REAL4>, api$
  if isR ge noMem
  movd xmm0, tmp$ ; real4 to xmm0 (no conversion)
  movd dword ptr csDest, xmm0
  else ; first 4 in xmm? and rcx rdx r8 r9
  ; % echo DEST: csDest
  tmp2$ CATSTR <movd xmm>, %isR, <,>, tmp$
  ; % echo **** Passing a REAL8: tmp2$
  % tmp2$
  if cVarArg ; Parameter passing: Floating-point values are only placed in the integer registers RCX, RDX, R8, and R9 when there are varargs arguments
  tmp2$ CATSTR <movd >, pushReg, <, xmm>, %isR
  ; % echo **** Passing a REAL8 both to xmm? and reg64: tmp2$
  % tmp2$
  endif
  endif
  elseif type(tmp$) LT SIZE_P ; zero-extend
  ; % echo xx  tmp$  xx less than size_p
jTypeChk cVarArg, isR, c1$, <1DWORD>, api$ ; let's check if the callee wants something else
  oa = opattr tmp$
  if oa eq atImmediate
  if isR lt noMem ; use registers
  ife tmp$
  xor pushRegD, pushRegD ; shortest option?
  else
  if tmp$ eq -1
  xor pushRegD, pushRegD
  dec pushReg
  elseif tmp$ LT 0
  mov pushReg, tmp$
  else
  mov pushRegD, tmp$
  endif
  endif
  ; no mov csDest, pushReg
  else ; move immediate into stack
  ife tmp$
  and qword ptr csDest, 0 ; shortest option; dword is ok for regs but not mem
  else
  if tmp$ eq -1
  or qword ptr csDest, -1
  else
  mov qword ptr csDest, tmp$
  endif
  endif  
  endif
  elseif oa eq atRegister
  mov csDest, tmp$
  else
  if type tmp$ LT DWORD
  movsx pushRegD, tmp$
  else
  mov pushRegD, tmp$
  endif
  if isR ge noMem
  mov csDest, pushReg
  endif
  endif
  else ; SIZE_P (s-code 1)
  isXmm INSTR tmp$, <xmm>
  if isXmm eq 1
  jTypeChk cVarArg, isR, c1$, <4REAL8>, api$ ; TypeCheck: 4, 3REAL4, c1=[3]
  ; % echo A: movlps csDest, tmp$
  movlps QWORD ptr csDest, tmp$
  if isR lt noMem
  if cVarArg
  ; % echo B: mov pushReg, tmp$
  movd pushReg, tmp$
  endif
  endif
  else
  jTypeChk cVarArg, isR, c1$, <1DWORD>, api$ ; TypeCheck: 4, 3REAL4, c1=[3]
  ifdifi pushReg, tmp$
  if isR ge noMem
  oa = (opattr tmp$) AND 127
if oa eq atRegister
  mov csDest, tmp$
else
  mov pushReg, tmp$
  mov csDest, pushReg
endif
  else
  mov pushReg, tmp$
  endif
  else
  if isR ge noMem ; otherwise fastcall
  mov csDest, tmp$ ; same as pushReg
  endif
  endif
  endif
  endif
  else ; vvv 32-bit code vvv
  if isOl
  tmp$ SUBSTR tmp$, 1+isOl
  lea pushReg, tmp$ ; addr or offset
  mPush pushReg ; 32-bit code
  elseif type tmp$ eq REAL4
  mov pushRegD, tmp$ ; use 32-bit instruction
  mPush pushReg
  tmp$ CATSTR <** Warning, line >, %@Line, <: passing a REAL4 may not work **>
% echo tmp$
  elseif type tmp$ LT SIZE_P ; zero-extend
  oa = opattr tmp$
  if oa eq atImmediate
  ife tmp$
  xor pushRegD, pushRegD
  push pushReg
  else
  push tmp$
  if isR LT 4
  mov pushReg, [rsp]
  endif
  endif
  else
  if type tmp$ LT DWORD
  movsx pushRegD, tmp$
  else
  mov pushRegD, tmp$
  endif
  mPush pushReg
  endif
  else
  ifdifi pushReg, tmp$
  mov pushReg, tmp$
  endif
  mPush pushReg
  endif
  endif
  else ; v v 32-bit code v v
  if isOl ; <addr>
  if isOl eq 7
  mPush tmp$
else
  tmp$ SUBSTR tmp$, 1+isOl
  oa = (opattr tmp$) AND 127
  if oa eq atGlobal
  push offset tmp$
  else
  lea edx, tmp$
  mPush edx
endif
  endif
  elseif (type tmp$ eq REAL8) or (type tmp$ eq QWORD) ; see add rsp
  mPush dword ptr tmp$[4]
  mPush dword ptr tmp$
  rspExtra=rspExtra+4
  else
  isXmm INSTR tmp$, <xmm>
  ife isXmm
  mPush tmp$
else
push eax
endif
  endif
  endif
  ; % echo ---- rev$ -------
  rev$ SUBSTR rev$, is+1
ENDM
  ; mov rbp, rbp ; ----- end moving args ---------
if @64
; tmp$ CATSTR <apiarg>, < has >, %ctArgs, < paras>
; % echo tmp$
if 0
mPush arg6 ; sixth parameter ; mazegen
mPush arg5 ; fifth parameter
sub rsp, 4*8 ; allocate space for 'Register Parameter Stack Area'
mov r9, arg4
mov r8, arg3
mov rdx, arg2
mov rcx, arg1
call function ; inactive
add rsp, 4*8 + 2*8 ; release all parameters from stack
endif
endif
ifidn api$, <ExitProcess>
j@ExDone=1
endif
if jbStrings
tmp$ CATSTR <CALL >, api$, < as >, %(curApi), </>, %jaCtApi
% echo tmp$
endif
if usedeb and apinum lt 50000 ; --- solved for x64 with syms and VS14, see bax ---
ife @64
tmp2$ CATSTR <mov edx, Chr$(">, api$, <")>
; % echo api$: tmp2$
% tmp2$
endif
endif
if @64X
; sub rsp, 4*8
rspExtra=rspExtra+1
endif
if apinum gt 50000
call api$ ;; user proc
else
call iaApi[SIZE_P*curApi]
endif
is INSTR info$, <c>
if (is eq 1) and (@64 eq 0)
add rsp, 4*ctArgs+rspExtra ; 32-bit C stack correction; rspex for QWORD
endif
if rspExtra and jbCompStyle eq 0
add rsp, rspExtra*32
endif
  endif
ENDM
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 07:54:35 AM
Quote from: jj2007 on July 11, 2023, 07:51:14 AM
Perhaps you haven't seen the jinvoke macro

Yes. That is an ugly spaghetti  :biggrin: :biggrin:
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 09:15:07 AM
Quote from: HSE on July 11, 2023, 07:54:35 AM
Quote from: jj2007 on July 11, 2023, 07:51:14 AM
Perhaps you haven't seen the jinvoke macro

Yes. That is an ugly spaghetti  :biggrin: :biggrin:

I knew you would like it :greensml:

Look how it translates the CreateWindowEx into real, efficient code:
int 3
jinvoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL
nop


int3                            |
and [rsp+58],0                  | NULL
mov r10,[140003054]             | wcx
mov [rsp+50],r10                | wcx
mov [rsp+48],6F                 | ID_EDIT
mov r10d,[rbp+10]               | hWnd
mov [rsp+40],r10                | hWnd
mov [rsp+38],1                  | 1
mov [rsp+30],1                  | 1
and [rsp+28],0                  | 0
and [rsp+20],0                  | 0
mov r9d,503001C4                | reStyle
xor r8d,r8d                     | NULL
lea rdx,[1400030B7]             | Chr$("RichEdit20A")
mov ecx,200                     | WS_EX
call [<&CreateWindowExA>]       |
nop                             |



P.S.: I fixed the mov rax, eax bug (version 4 attached):

-------- Sample text: --------
invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

mov rcx, WS_EX_CLIENTEDGE
mov rdx, Chr$("RichEdit20A")
mov r8, NULL
mov r9, reStyle
push 0
push 0
push 1
push 1
push hWnd
push ID_EDIT
push wcx.hInstance
push NULL
call CreateWindowEx

-------- Sample text: --------
INVOKE  WinMain, EAX

mov ecx, EAX
call WinMain

-------- Sample text: --------
INVOKE  ExitProcess, EAX

mov ecx, EAX
call ExitProcess


Of course, this is still a push orgy, so it's not real code as shown above.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 09:54:11 AM
Quote from: NoCforMe on July 11, 2023, 07:31:45 AM
And hey, it's not that complicated!.

:biggrin: So far (I'm in second example), for this simple things, spaghetti is more easy.

But better than to modify an spaghetti is to begin from zero. Then we have the chance to understand how these table driven FSM can be build (for more complex cases).  :thumbsup:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 11, 2023, 03:14:15 PM
Here's the latest version. Takes from 1 to 8 arguments, places the first 4 in registers, pushes any others on the stack.
Try it out.

I think this is as far as I go with this demo; it has met (and exceeded) the challenge by mineiro.

It'd be nice to get some feedback on this. I'm thinking of making an evaluation form, with questions like this:

A. What is your overall opinion of this demo?
Ha ha, just kidding. But seriously, give me some feedback here. Like I said, this isn't for everyone, but I think it demonstrates an important and very useful technique in text analysis.

It may seem complex, but believe me, after doing two or three of these, it's very easy to start a new parser from scratch. It's like riding a bicycle; the first few time are hard, but it becomes second nature after that. A lot of stuff can be block-copied to save coding time. And you can build very extensive parsers with this method. Just to show you, here's a command file for a graph-making program I did a long time ago that uses my parsing methods:


;===============================================
;   Sample MAKEGRAF control file (test.gcf)
;===============================================

;text (location=(400, 60) text="Index #16" color=16)
;text (location=(20, 180) text="Index #1" color=1)
;text (location=(20, 140) text="Index #2" color=2)
;text (location=(20, 100) text="Index #3" color=3)
;text (location=(20, 60) text="Index #4" color=4)
;text (location=(100, 180) text="Index #5" color=5)
;text (location=(100, 140) text="Index #6" color=6)
;text (location=(100, 100) text="Index #7" color=7)
;text (location=(100, 60) text="Index #8" color=8)
;text (location=(200, 180) text="Index #9" color=9)
;text (location=(200, 140) text="Index #10" color=10)
;text (location=(200, 100) text="Index #11" color=11)
;text (location=(200, 60) text="Index #12" color=12)
;text (location=(400, 180) text="Index #13" color=13)
;text (location=(400, 140) text="Index #14" color=14)
;text (location=(400, 100) text="Index #15" color=15)
;text (location=(500, 60) text="!" color=16)

palette (load "test.gpf"
13(255,0,0) ;define red as RED!
)

graph (
size=(640,400)
filename="test.bmp"
bgcolor=11
)

grid (llcorner=(80, 60)
gridcolor=16
axisthickness=2
width=500
height=300
bgcolor=7)

line(start=(100,75)end=(250,120))
line(start=(250,120)end=(310,270))
line(start=(310,270)end=(450,120))
line(start=(450,120)end=(460,300))

font="5x9.sff"
font="7x11.sff"

; This statement shows a bug: horizontal rotated text doesn't render properly:
;text(location=(400,200) text="Weird; rotated horizontal text" rotation=TRUE)

text (location=(200, 390) text = "Civilians Killed in Iraq (M)" color=1)
text (location=(40, 380) text="3!#$&%*/(0123456789)" color=6 font="5x9.sff")
text (location=(40, 50)
text="!#$&%*/(0123456789):;@<=>ABCDEFGHIJKLMNOPQRSTUVWXYZ?[\\]^`ab"
color=2 font="7x11.sff")
text (location=(40, 30)
text=".,'\"cdefghijklmnopqrstuvwxyz{|}~"
color=2 font="7x11.sff")
text(location=(50,350)text="0123456789" color=13 font="7x11.sff" direction=vert)
text(location=(50,100)text="ROTATED TEXT" color=14 font="7x11.sff" direction=vert rotation=true)

dot (location=(100,76) color=13)
dot (location=(250,121) color=13)
dot (location=(310,270) color=13 shape=square)
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 11, 2023, 06:11:32 PM
Quote from: NoCforMe on July 11, 2023, 03:14:15 PM
It'd be nice to get some feedback on this. I'm thinking of making an evaluation form, with questions like this:

A. What is your overall opinion of this demo?

  • I think it's great and can't wait to implement it.
  • It's interesting, but maybe some other day.
  • Not sure about this.
  • You'd have to pay me to even think about using this!
  • I'd never use this even if you paid me!

6. You are almost there :thumbsup:

Using invoke lines from your latest source:
INVOKE--> code parser demo, version 4
Allows dec/hex/binary #, registers, var or ADDR var for
up to 8 arguments (requires at least 1).

Enter statement to test: >INVOKE  WinMain, EAX

        MOV     RCX, EAX    ; <<<<<<<<<<<<<< error
        CALL    WinMain

Enter statement to test: >INVOKE  ExitProcess, EAX

        MOV     RCX, EAX
        CALL    ExitProcess

Enter statement to test: >INVOKE  StdOut, OFFSET ProgramHeading

Tokenization error.

Enter statement to test: >INVOKE  StdIn, OFFSET InputBuffer, SIZEOF InputBuffer

Tokenization error.

Enter statement to test: >INVOKE  StdOut, OFFSET CRLFstr

Tokenization error.

Enter statement to test: >INVOKE  wsprintf, ADDR buffer, OFFSET CALLfmt

Tokenization error.

Enter statement to test: >INVOKE  strcmpi, OFFSET TextBuffer, [EBX].$T_entry.T_IDptr


For testing, it might be easier to use a text file with examples, like the attached one, instead of typing all the time.
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 11, 2023, 10:07:32 PM
Quote from: jj2007 on July 11, 2023, 06:11:32 PM
Enter statement to test: >INVOKE  WinMain, EAX

        MOV     RCX, EAX    ; <<<<<<<<<<<<<< error
        CALL    WinMain

No error, that result is correct! For this FSM, EAX is a variable name.  :biggrin:

Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 04:41:47 AM
Thanks, Héctor. People keep throwing stuff at my poor li'l parser thinking it knows the entire universe of MASM symbols. It doesn't, just a limited subset of them. Think of how complex it would have to be in order to handle expressions like

[RDX].Table + 12
[RAX + RBX + Table]
[RAX+RBX+Table]

Ironically, my parser can handle that last one, since there are no embedded spaces; it's just another "unknown identifier" to it:

Enter statement to test: >invoke function, [RDX+RAX+Table]

        MOV     RCX, [RDX+RAX+Table]
        CALL    function

But it has no idea what all those particles within it mean.

So have I met the original challenge (mineiro's)?
Title: Re: Parsing Text file in Assembly Language
Post by: mineiro on July 12, 2023, 06:05:28 AM
I am now downloading your program.
I intend to play with your toy during this week, if I made some changes I will post them in this topic with your permission.
Thank you sir NoCforMe,.
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 12, 2023, 06:08:24 AM
Quote from: NoCforMe on July 12, 2023, 04:41:47 AM
Thanks, Héctor. People keep throwing stuff at my poor li'l parser

Invoke someproc, [RDX].Table + 12, [RAX + RBX + Table], [RAX+RBX+Table]

mov rcx, [RDX].Table + 12
mov rdx, [RAX + RBX + Table]
mov r8, [RAX+RBX+Table]
call someproc
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 07:34:35 AM
That'll make it through my demo if you remove the spaces:


Enter statement to test: >Invoke someproc, [RDX].Table + 12, [RAX + RBX + Table], [RAX+RBX+Table]

Tokenization error.

Enter statement to test: >Invoke someproc, [RDX].Table+12, [RAX+RBX+Table], [RAX+RBX+Table]

        MOV     RCX, [RDX].Table+12
        MOV     RDX, [RAX+RBX+Table]
        MOV     R8, [RAX+RBX+Table]
        CALL    someproc
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 12, 2023, 07:37:20 AM
No way to fix that spaces problem? After all, the token delimiter is clearly the comma...
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 07:53:48 AM
Quote from: jj2007 on July 12, 2023, 07:37:20 AM
No way to fix that spaces problem? After all, the token delimiter is clearly the comma...

JJ, you really don't seem to understand what's going on here. Yes, I could "fix the spaces problem" by only recognizing the comma as the delimiter. But first the more trivial problem: that would mean that spaces would be included in any identifier, like, say, "RAX " or "varName " where the user put a space between the ID and the comma(which of course is allowed in MASM syntax). Which would mess up the formatting. (Would probably still produce valid assemble-able code, but still.)

But the more important problem is that the parser still wouldn't understand at all what the component parts of the expression are, and which sequences of them are legal and which are not. Which you can see is a non-trivial problem, one which is waaaaay beyond the scope of what was spozed to be a somewhat simple demo.

Later: I tried what you suggested, which was to allow a space to be part of an identifier--super-easy change, just change one of the jump targets in the tokenization table--but that broke the whole thing. Scratched my head for a bit, why did that happen? Wellll, because a space is a delimiter, between "invoke" and the function name. So that won't work.

The only proper way to do it would be to handle the universe of address expressions, which is enormous. Not gonna happen for this demo.
Title: Re: Parsing Text file in Assembly Language
Post by: zedd151 on July 12, 2023, 08:40:17 AM
Quote from: NoCforMe on July 12, 2023, 07:53:48 AM
JJ, you really don't seem to understand what's going on here. Yes, I could "fix the spaces problem" by only recognizing the comma as the delimiter. But first the more trivial problem: that would mean that spaces would be included in any identifier, like, say, "RAX " or "varName " where the user put a space between the ID and the comma
Shouldn't be too hard to do a little 'preprocessing'. I have a qe plugin (fixpunc) that removes any space(s) before a comma and places a single space after the comma & removes any extraneous spaces after the comma (if more than 1). Not that you need a qe plugin, but the algo is very simple...  :icon_idea:


src is the source buffer, dst is the destination buffer...
fixpunc    proto :dword, :dword


    .code
    fixpunc proc src:dword, dst:dword
        mov ecx, src
        mov edx, dst
        top:
        mov al, [ecx]
        cmp al, 0
        jz done
        cmp al, ","
        jz comma1
        mov [edx], al
        inc ecx
        inc edx
        jmp top
        comma1:
        cmp byte ptr [edx-1], 20h
        jnz @f
        dec edx
        jmp comma1
        @@:
        mov [edx], al
        inc edx
        @@:
        inc ecx
        cmp byte ptr [ecx], 20h
        jnz movcomm
        mov al, [ecx]
        mov [edx], al
        inc edx
        @@:
        inc ecx
        cmp byte ptr [ecx], 20h
        jz @B
        jmp top
        movcomm:
        mov byte ptr [edx], 20h
        inc edx
        jmp top
        done:
        ret
    fixpunc endp
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 12, 2023, 02:02:53 PM
Quote from: NoCforMe on July 12, 2023, 07:53:48 AMNot gonna happen for this demo.

:thumbsup:

For a further step, can't be a big deal. Just requiere another state. You can read space and comma at least in 2 different states... but I'm still in example 2  :biggrin:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 03:09:18 PM
No, it would be kind of a big deal. Here's the thing: the tokenizer in the demo (not the parser) looks for all non-numeric chunks of text as "identifiers". This includes
none of which contain spaces.

To allow constructs like [RAX + RDX].Table + 2, the tokenizer would have to be expanded to cover expressions within square brackets and arithmetic expressions. Plus the parser would have to be able to follow these sequences. You can see that this is definitely a non-trivial process.

The tokens for that expression would be
Title: Re: Parsing Text file in Assembly Language
Post by: lingo on July 12, 2023, 03:18:53 PM
Before: buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0

Result: buffera db "Invoke someproc,[RDX].Table+12,[RAX+RBX+Table],[RAX+RBX+Table]",0,0,0,0  :tongue:


.data
TableDividers db ",", ".", "+", "-", "*","/","(",")","[","]",0
buffera db "Invoke someproc, [RDX ].Table + 12, [RAX + RBX + Table], [RAX+ RBX +Table ]",0,0,0,0,0,0


.code
Parser proc

LOCAL SaveRCX :QWORD,SaveRDX: QWORD, SaveRBX: QWORDRD
mov    SaveRCX,rcx
mov    SaveRDX,rdx
mov    SaveRBX,rbx
               lea   rax,buffera           ; source
               mov   rbx, rax    ; dest
@@:
               mov   cl, byte ptr[rax]
               add   rax,1
               test  cl,cl   
               je    ende   
               cmp   cl,20h
               jne   @laba3
               cmp   byte ptr [rax],20h
             lea   rax,[rax+1]
               je    @b   ; Skip more spaces
       mov   ch,byte ptr [rax-3]
       sub   rax,1
               lea   rdx, TableDividers
@laba1:
               cmp   ch,byte ptr [rdx]
       je    @b
       cmp   byte ptr [rdx],0 ;not found
       lea   rdx,[rdx+1]
               jne   @laba1
       mov   ch, byte ptr [rax]
               lea   rdx, TableDividers
@laba2:
               cmp   ch,byte ptr [rdx]
       je    @b
       cmp   byte ptr [rdx],0 ;not found
       lea   rdx,[rdx+1]
               jne   @laba2
@laba3:
       mov   byte ptr [rbx],cl
               add   rbx,1
               jmp   @b
ende:
               mov   dword ptr [rbx],0
               lea   rax,buffera

mov rcx, SaveRCX
mov rdx, SaveRDX
mov rbx, SaveRBX
               ret
Parser         endp


:tongue:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 03:24:56 PM
OK, well, maybe ... I can see it does work, but it's basically cheating, and isn't that only designed for that particular address expression?

Also, suggestion: some comments would help. I can't figure out from glancing at it just what the hell your code does.

But interesting. A+ for cleverness. (Oh, and I really like your animated avatar. Cuuute.)
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 12, 2023, 04:06:05 PM
Getting back to my demo, for anyone who's interested in the guts of the thing, this is the data structure that drives the whole parsing process (after tokenization):


;***** The Parsing Sequence  *****
ParsingSequence LABEL $Pnode

_pn0 $Pnode <_pn1, $T_INVOKE, NULL>
DD -1

_pn1 $Pnode <_pn2, $T_ID, StoreFname>
DD -1

_pn2 $Pnode <_pn3, $T_comma, NULL>
DD -1

_pn3 $Pnode <_pn5, $T_ID, StoreArg>
$Pnode <_pn5, $T_number, StoreArg>
$Pnode <_pn5, $T_register, StoreArg>
$Pnode <_pn4, $T_ADDR, TagArgAsADDR>
DD -1

_pn4 $Pnode <_pn5, $T_ID, StoreArg>
DD -1

_pn5 $Pnode <_pn3, $T_comma, NULL>
$Pnode <NULL, $T_EOL, NULL>
DD -1


That's all, a linked list of $Pnode structures. You can follow it through:

1. At node _pn0, see the token INVOKE, go to _pn1.
2. At node _pn1, see the token ID (function name), call StoreFname(), go to _pn2.
3. At node _pn2, see a comma, go to _pn3.
4. At node _pn3, see the token ID, call StoreArg(), go to _pn5
   see the token number, call StoreArg(), go to _pn5
   see the token register, call StoreArg(), go to _pn5
   see the token ADDR, call TagArgAsADDR(), go to _pn4.
5. At node _pn4, see the token ID, call StoreArg(), go to _pn5.
6. At node _pn5, see a comma, go back to _pn3.
    see the token EOL (end o'line), STOP (parser sees $T_EOL and ends processing).

So assuming your tokenizer gives you all the tokens your text contains, you only need to expand on this structure to do all kinds of parsing tasks (with some small stub subroutines to go along with it). That's the beauty of this method (if I don't say so myself).
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 16, 2023, 08:58:45 AM
Quote from: HSE on July 12, 2023, 02:02:53 PM
but I'm still in example 2  :biggrin:

Ok! Second essay example is working, in a not very impressive way  :biggrin::$parseSuccess

Press any key to continue ...


Just the skeleton to see how parser work in a debugger.

Because the code is already in the text, I added a little challenge: to make a Neutral Bitness Code (Friedrich et al. syntax).

Then you can build same code with ML or ML64, using MASM32 SDK or MASM64 SDK, resulting obviously a 32 or 64 bits binary file.

Probably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits   :smiley:
Title: Re: Parsing Text file in Assembly Language
Post by: NoCforMe on July 16, 2023, 09:19:18 AM
Wow. I'm impressed. Also honored that you took the time to actually work through that example. I take that as some kind of compliment.

So after writing this, what do you think? Was it worthwhile? Do you think you might ever actually use this for a parsing task?

It'd be cool if you did, and to see what modifications you make (besides making the code 64-bit friendly).
Title: Re: Parsing Text file in Assembly Language
Post by: HSE on July 16, 2023, 09:40:27 AM
 :thumbsup:

Quote from: NoCforMe on July 16, 2023, 09:19:18 AM
So after writing this

It's just a first step to see how that work  :biggrin:
Title: Re: Parsing Text file in Assembly Language
Post by: jj2007 on July 16, 2023, 10:57:03 AM
Quote from: HSE on July 16, 2023, 08:58:45 AMProbably Hutch would have found it funny that macros developed for 64-bits are also used in 32-bits   :smiley:

Yeah, funny (https://masm32.com/board/index.php?topic=10958.0), isn't it :mrgreen: