News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing Text file in Assembly Language

Started by NoCforMe, July 07, 2023, 12:27:01 PM

Previous topic - Next topic

NoCforMe

Actually it's no more complex than your original spec, except for handling hex #s, which is only slightly more complicated.

Here's my FSA diagram for the tokenizer. This is probably the most important part of the process--drawing the tokenizer. It needs to be done on paper. All the code you'll write will come from this picture.

Should be self-explanatory. This will become a subroutine called GetNextToken(). The top part (nodes A, 1 and B) tokenizes identifiers. Nodes G and H return EOL and comma respectively. The rest of it is for numbers (decimal or hex).
Assembly language programming should be fun. That's why I do it.

mineiro

yes, like that.
Tabulations (09h) and spaces(20h) inside this invoke logic (parser) should be look as ignore char (increase pointer). The symbol "," will be the tokenizer.
If inside invoke context, previoussss valid char before the eol was "\", so continue to next line (ignore eol) until not found "\" (real eol).
Some persons write:
invoke function,\               ;some comments, there's tabs and spaces separating this
        addr something,\        ;another
        1,\
        2,\
        3

That's all. If an user hits:
db "invoke something"
So, that will be outside this "scope" and that scope will treat invoke as literal because quotes, symbols precedence, or, other scope, nothing to do here.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

NoCforMe

Here's a demo tokenizer, attached below. It's a console program that prompts you for a statement, then tokenizes it (or at least attempts to), and displays the results. Was kinda fun to make and to use. Try it out.

You'll notice that it does a great job of recognizing tokens, but it's dumb as a stump when it comes to making any kind of sense out of what you type. All of the following will successfully pass through it:

  • invoke function, addr sam, edx, 0ffffh, addr var
  • sam invoke eax, , a box of pills
  • eax 2 3 4 invoke,
That's because it's just a dumb tokenizer that doesn't know anything about the "grammar" of the statement it's working on, meaning what needs to go where for the statement to make sense. That'll happen during the next task. All it knows is how to break apart the statement into its "atomic" parts. But we've accomplished a lot so far. Stay tuned.

The two important things to look at here in the source are the tokenization table (TokenParseTbl) and associated data, and the GetNextToken() subroutine which does the tokenization. The rest of the stuff is support functions: getting characters out of the buffer, matching IDs, converting #s, uppercasing characters, etc.
Assembly language programming should be fun. That's why I do it.

NoCforMe

OK, last thing for tonight: here's the parsing sequence for all those tokens we can now extract. This defines the "grammar" of the statement. I've followed Hector's "challenge" here, but expanded a bit on the last 3 arguments, which can be any of a variable name, the ADDR of a variable, a number (decimal or hex), or the name of a register. It's very easy to accommodate these without a ton of spaghetti code, since this thing is data-driven as you'll see. But first I have to get some sleep ...

Again, it's very important that you put this down on paper first before typing one single character into your code.
Assembly language programming should be fun. That's why I do it.

jj2007

My take on the tokeniser (full source of 76 lines attached). There is a text file with samples attached, edit at your convenience; it contains lines like this:

InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment

Here is the tokeniser:

GetToken proc ; expects string in esi
  .Repeat
lodsb
  .Until al>="@" || (al>="0" && al<="9") || al==59
  dec esi ; esi on someApi
  mov eax, esi
  .Repeat
inc eax
mov dl, [eax]
  .Until dl=="," || !dl || dl==59 || dl==13 ; 59=comment
  push edx ; last byte (could be zero or carriage return)
  .Repeat
dec eax
mov dl, [eax]
  .Until dl>="@" || (dl>="0" && dl<="9") ; eax points to trailing bytes
  inc eax
  push eax ; addr last valid byte
  sub eax, esi
  xchg eax, ecx
  Let t$(tCt)=Left$(esi, ecx)
  inc tCt
  pop esi 
  pop eax
  and eax, 11110010b ; stop for Cr=1101b and nullbyte but not for comma or comment
  ret
GetToken endp


Sample output (yes, the assembler would throw two errors):
-------- Sample text: --------
invoke MyAlgo, eax, 123, 456

mov rcx, eax
mov rdx, 123
mov r8, 456
call MyAlgo

-------- Sample text: --------
InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar,addr    MyVar2  ; comment
line two

mov rcx, 111h
mov rdx, eax
lea r8, mytext
mov r9, 123
push 456
lea rax, MyVar
push rax
lea rax, MyVar2
push rax
call someApi123

HSE

#35
Quote from: NoCforMe on July 09, 2023, 06:27:30 AM
One thing to realize is how easily this code flows from the diagram.

Mmm  :biggrin:

How it move from B to C? (first example)
Equations in Assembly: SmplMath

mineiro

Quote from: jj2007 on July 09, 2023, 08:22:26 PM
Sample output (yes, the assembler would throw two errors):
Or maybe three:

mov rcx, eax

line two

push 456
push rax
push rax
3 pushed itens in stack (dword,qword,qword), rsp register will be unaligned before instruction call supposing expansion to 3 qwords..
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on July 10, 2023, 02:31:24 AM
Quote from: jj2007 on July 09, 2023, 08:22:26 PM
Sample output (yes, the assembler would throw two errors):
Or maybe three:

mov rcx, eax

line two

push 456
push rax
push rax
3 pushed itens in stack (dword,qword,qword), rsp register will be unaligned before instruction call supposing expansion to 3 qwords..

- The "line two" was just a test if the tokeniser recognises a carriage return correctly.
- A push 123 is not DWORD, it's qword.
- Re stack alignment, in real code, a CreateWindowEx looks like this:

invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

000000014000117C | 48:836424 58 00            | and [rsp+58],0                  |
0000000140001182 | 4C:8B15 C31E0000           | mov r10,[14000304C]             |
0000000140001189 | 4C:895424 50               | mov [rsp+50],r10                |
000000014000118E | 48:C74424 48 6F000000      | mov [rsp+48],6F                 | 6F:'o'
0000000140001197 | 44:8B55 10                 | mov r10d,[rbp+10]               |
000000014000119B | 4C:895424 40               | mov [rsp+40],r10                |
00000001400011A0 | 48:C74424 38 01000000      | mov [rsp+38],1                  |
00000001400011A9 | 48:C74424 30 01000000      | mov [rsp+30],1                  |
00000001400011B2 | 48:836424 28 00            | and [rsp+28],0                  |
00000001400011B8 | 48:836424 20 00            | and [rsp+20],0                  |
00000001400011BE | 41:B9 C4013050             | mov r9d,503001C4                |
00000001400011C4 | 45:33C0                    | xor r8d,r8d                     | r8d:"flags=3"
00000001400011C7 | 48:8D15 E11E0000           | lea rdx,[1400030AF]             | 00000001400030AF:"RichEdit20A"
00000001400011CE | B9 00020000                | mov ecx,200                     |
00000001400011D3 | FF15 A3330000              | call [<&CreateWindowExA>]       |

NoCforMe

OK, I see I have some competition here. I'll persist in my project nonetheless

This is a learning experience for me as well; going over my design, I can see some flaws that should be corrected (in the next version):

  • Number handling: since I'm converting all #s (decimal or hex) to binary, they're going to look different from what the original coder wrote (plus I can't handle negative #s). In the next version, numbers will simply be treated as text (e.g., -1 or 0FFFFh or whatever), after validation.
  • Since registers are treated as "unknown identifiers", it's possible to have nonsensical constructs like ADDR rsi. Easily fixed by adding all known registers to the ID match list and creating a new token type, say $T_register.
Probably there are other issues I haven't discovered yet.
Hopefully later today I'll have a full parser coded and posted here. We'll see.
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on July 10, 2023, 05:38:38 AMit's possible to have nonsensical constructs like ADDR rsi.

Right, but that's going one step further: a syntax check, plus error messages...

NoCforMe

Here's the parsing demo, attached below.

It works, and I believe it meets Hector's challenge (he'll have to be the judge of that). As I said, there are some problems with it, and certainly room for refinement. I'll put out another version soon that addresses at least some of these issues.

Notice that when entering any of the last 3 variables, they can be:

  • the name of a variable
  • the address of a variable (ADDR varname)
  • a decimal or hexadecimal number (hex numbers have "h" appended, of course)
Notice that it correctly handles the difference between a variable value and its address:

varname--> MOV [register], varname
ADDR var--> LEA [register], varname


Also, with Hector's permission, I would allow the first parameter to be any of these things. I don't see any reason why it should be limited to ADDR varname. (But then I don't do 64-bit programming.)

I don't like the way it handles numbers. if you enter a # in hex it spits it out in decimal. The correct way to handle this will be to preserve the original text the user entered, including negative numbers, which will require changes to the tokenizer. That will also simplify the code by eliminating the numeric conversion routines.

* I learned something from this, which is that MASM accepts negative hex numbers. I didn't know this, as I've never used them.

One problem is that you can still enter nonsensical constructs, like ADDR RSI. because it doesn't know a register name from a hole in the ground. So I'll add all the register names to the list of known IDs and tag them as a register type to avoid this problem.

Let me know whatcha think.
Assembly language programming should be fun. That's why I do it.

jj2007

Enter statement to test: >InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar

Tokenization error.

Enter statement to test: >invoke someApi, 111h, eax

Tokenization error.


What's wrong?

The previous version (INVOKEtokenizer) works fine:
Enter statement to test: >invoke MyAlgo, eax, 123, 456

TOKEN: INVOKE
TOKEN: Unknown ID: "MyAlgo"
TOKEN: comma
TOKEN: Unknown ID: "eax"
TOKEN: comma
TOKEN: number (123)
TOKEN: comma
TOKEN: number (456)
TOKEN: EOL. Tokenization completed successfully!

NoCforMe

Quote from: HSE on July 10, 2023, 01:40:36 AM
Quote from: NoCforMe on July 09, 2023, 06:27:30 AM
One thing to realize is how easily this code flows from the diagram.

Mmm  :biggrin:

How it move from B to C? (first example)

Sorry I missed your question before. You're talking about that trivial 1st example, right? if you're on node B, seeing a numeric character (#) will move you to node C.

Be sure to check out the parsing demo I just posted.
Assembly language programming should be fun. That's why I do it.

NoCforMe

Quote from: jj2007 on July 10, 2023, 09:52:48 AM
Enter statement to test: >InVoke someApi123, 111h, eax, ADDR mytext, 123,456,   addr MyVar

Tokenization error.

Enter statement to test: >invoke someApi, 111h, eax

Tokenization error.


What's wrong?

Sorry, should have been more explicit.

I'm following Hector's challenge pretty closely, where the first argument must be ADDR var. Try that. You also have to give 4 arguments, the last 3 of which can be any of what I described above.

Here's the template:

INVOKE function, ADDR var1, var2, var3, var4

There's no reason it has to work that way; I'm just following the format of the challenge. I'll modify it to accept any number of arguments.

Additional enhancement: For the sake of completeness, it should handle binary #s too (10100011B). Version after the next, maybe.
Assembly language programming should be fun. That's why I do it.

jj2007

Never mind, I also just discovered a little glitch in my version :tongue:

invoke CreateWindowEx, WS_EX_CLIENTEDGE, Chr$("RichEdit20A"), NULL, reStyle, 0, 0, 1, 1, hWnd, ID_EDIT, wcx.hInstance, NULL

mov rcx, WS_EX_CLIENTEDGE
mov rdx, Chr$("RichEdit20A
mov r8, NULL
mov r9, reStyle
push 0
push 0
push 1
push 1
push hWnd
push ID_EDIT
push wcx.hInstance
push NULL
call CreateWindowEx


Correct version attached.