News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing Text file in Assembly Language

Started by NoCforMe, July 07, 2023, 12:27:01 PM

Previous topic - Next topic

HSE

Hi David,

Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here.

I have request  :biggrin:

Can you post complete first example? For what I'm reading, flow get stuck forever in last column:gt20:
    jmp qword ptr [rbx + TokenParseTbl + $tokenAnyOffset]
Equations in Assembly: SmplMath

NoCforMe

Quote from: jj2007 on July 08, 2023, 07:22:12 PM
Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here. (Assuming it's not too complicated!) Be sure to specify exactly the text you need to be interpreted.
I can offer a haystack, i.e. a fat text to be parsed: http://www.jj2007.eu/Bible.zip

Quote from: jj2007 on April 28, 2023, 04:35:21 PM
UnzipFile:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  UnzipInit "http://www.jj2007.eu/Bible.zip" ; file or URL
  UnzipFile(0, "C:\Masm32") ; extract C:\Masm32\Bible.txt
EndOfCode


OK, that's fine, but what's the particular needle you're searching for in that haystack?
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on July 09, 2023, 05:22:29 AMOK, that's fine, but what's the particular needle you're searching for in that haystack?

For example, all phrases that start with "Satan"?

The problem is here that the OP has lost interest, and we are just playing around here.

NoCforMe

Quote from: jj2007 on July 09, 2023, 05:33:40 AM
Quote from: NoCforMe on July 09, 2023, 05:22:29 AMOK, that's fine, but what's the particular needle you're searching for in that haystack?

For example, all phrases that start with "Satan"?

OK, I may tackle that after I post the example that Hector requested.

QuoteThe problem is here that the OP has lost interest, and we are just playing around here.

Hmm, not really a problem. We're allowed to play around if we like.
After all, didn't you say that assembly should be fun?
Assembly language programming should be fun. That's why I do it.

NoCforMe

Quote from: HSE on July 09, 2023, 03:24:52 AM
Hi David,

Quote from: NoCforMe on July 08, 2023, 02:02:57 PM
Hey, I'll make this offer: if anyone (including the OP) posts the spec for something they want parsed, I'll create a parser for it using my scheme and post the code here.

I have request  :biggrin:

Can you post complete first example? For what I'm reading, flow get stuck forever in last column:gt20:
    jmp qword ptr [rbx + TokenParseTbl + $tokenAnyOffset]


Here's the first example fleshed out. I left out some code that's needed (string matching and numeric conversion), but this should give the general idea. (You also need to provide a GNC(), but this is trivial, just get the next byte from the text buffer.)


$exitCodeSuccess EQU 0
$exitCodeError EQU -1

.data

TextBufferPtr DD ?

;====================================
; The Tokenization Table
;
; This is what drives the whole process
;====================================

TokenParseTbl LABEL DWORD

;     [#]    =     ,    EOL  [any]
; ------------------------------------------
DD _pnX, _pnX, _pnX, _pnX, _pnA ;0
DD _pnX, _pnB, _pnX, _pnX, _pn1 ;1
DD _pnC, _pnX, _pnX, _pnX, _pnX ;2
DD _pn3, _pnX, _pnD, _pnX, _pnX ;3
DD _pnE, _pnX, _pnX, _pnX, _pnX ;4
DD _pn5, _pnX, _pnX, _pnF, _pnX ;5

TokenParseChars DB '=', ',', $EOL
$numParseChars EQU $ - TokenParseChars

$tokenRow1 EQU ($numParseChars + 2) * 4 ;DWORD offset
$tokenRow2 EQU $tokenRow1 * 2
$tokenRow3 EQU $tokenRow1 * 3
$tokenRow4 EQU $tokenRow1 * 4
$tokenRow5 EQU $tokenRow1 * 5

$tokenAnyOffset EQU ($numParseChars + 1) * 4

TextBuffer DB 256 DUP(?)


.code

;====================================
; Tokenizer()
;
; Reads the text stream, breaks it into tokens.
; On return from a successful tokenization,
; the values "x" and "y" are stored.
;
; Returns:
; $exitCodeSuccess or
; $exitCodeError
;====================================

Tokenizer PROC

XOR EBX, EBX ;Start @ row 0.

parse0: CALL GNC ;Get next char. from buffer.
CMP AL, $CR ;Throw away carriage returns.
JE parse0
CMP AL, $tab ;Magically turn tabs into spaces.
JNE gt3
MOV AL, ' '

; 1. Weed out numeric digits:
gt3: CMP AL, '0'
JB gt10
CMP AL, '9'
JA gt10
JMP DWORD PTR [EBX + TokenParseTbl] ;[#] is 1st col. in  table row.

; 2. Try to match any parsing characters:
gt10: LEA EDI, TokenParseChars
MOV EDX, EDI ;Save pointer to list of chars.
MOV ECX, $numParseChars
REPNE SCASB
JNE gt20 ;No match, go to "any" col.

SUB EDI, EDX ;Get offset from start of list.
SHL EDI, 2 ;Convert to DWORD offset.
JMP DWORD PTR [EBX + EDI + TokenParseTbl]

; No match, so go to "any char." column:
gt20: JMP DWORD PTR [EBX + TokenParseTbl + $tokenAnyOffset]


;====================================
; Tokenizing Nodes
;====================================

;*****  Set up for ID storage:  *****
_pnA LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow1

; ... fall through to ...

;***** Store ID char.:  *****
_pn1 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

_pn2 LABEL NEAR ;"Do-nothing" node.
JMP parse0

;***** Match ID against string ("hotspot"):  *****
_pnB LABEL NEAR

; (insert code to do string matching here)
; Will also have to return an error if the string doesn't match "hotspot".
MOV EBX, $tokenRow2
JMP parse0

;*****  Set up for # storage:  *****
; (we use the same buffer for text and # storage)
_pnC LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow3

; ... fall through to ...

;***** Store # char.:  *****
; (notice this is the same as _pn1; we could re-use that node
; and save some space, but let's keep this for clarity)
_pn3 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

;***** Save "x" value:  *****
_pnD LABEL NEAR

; (insert code to convert ASCII digits to binary and store it
;  as the value for "x")
MOV EBX, $tokenRow4
JMP parse0

;*****  Set up for # storage:  *****
_pnE LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow5

. ... fall through to ...

;***** Store # char.:  *****
_pn5 LABEL NEAR
MOV EDX, TextBufferPtr
MOV [EDX], AL
INC TextBufferPtr
JMP parse0

;***** Save "y" value, exit w/success:  *****
_pnF LABEL NEAR

; (insert code to convert ASCII digits to binary and store it
;  as the value for "y")
MOV EAX, $exitCodeSuccess
RET

;***** Return error:  *****
_pnX LABEL NEAR
MOV EAX, $exitCodeError
RET

Tokenizer ENDP


One thing to realize is how easily this code flows from the diagram. Look at the diagram for that 1st example, and you should be able to see how the code for the tokenization "nodes" follows it. That's what I really like about this method, it kind of writes itself.
Assembly language programming should be fun. That's why I do it.

jj2007

Ok, there are no phrases that start with Satan in Bible.txt, so I used The Lord instead :biggrin:

Source:
include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi=FileRead$("bible.txt")
  xor ecx, ecx
  .While 1
inc ecx ; increase start position
.Break .if !Instr_(ecx, esi, "The Lord", 4) ; full word, case-sensitive
mov edi, eax ; save start
lea ecx, [edx+8] ; advance index
.Repeat
dec eax
mov dl, [eax]
.Until dl==33 || dl=="?" || dl=="." || dl==10 || dl>"@" ; end.!? Satan
.if Zero?
mov eax, edi ; start
.Repeat
inc eax
mov dl, [eax]
.Until dl==33 || dl=="?" || dl=="." || dl==13 || !dl ; end of phrase
sub eax, edi
inc eax ; get the dot, too
.if eax>80
m2m eax, 80 ; too long for display
.endif
PrintLine Left$(edi, eax)
.endif
  .Endw
  Inkey "done"
EndOfCode


Output:
The Lord work a care and conscience in us to know Him and serve Him, that we may
The Lord of heaven and earth bless Your Majesty with many and happy days, that,
The Lord shall laugh at him: for he seeth that his day is coming.
The Lord gave the word: great [was] the company of those that published [it].
The Lord said, I will bring again from Bashan, I will bring [my people] again fr
The Lord at thy right hand shall strike through kings in the day of his wrath.
The Lord sent a word into Jacob, and it hath lighted upon Israel.
The Lord GOD hath given me the tongue of the learned, that I should know how to
The Lord GOD hath opened mine ear, and I was not rebellious, neither turned away
The Lord GOD which gathereth the outcasts of Israel saith, Yet will I gather [ot
The Lord hath trodden under foot all my mighty [men] in the midst of me: he hath
The Lord hath swallowed up all the habitations of Jacob, and hath not pitied: he
The Lord was as an enemy: he hath swallowed up Israel, he hath swallowed up all
The Lord hath cast off his altar, he hath abhorred his sanctuary, he hath given
The Lord GOD hath sworn by his holiness, that, lo, the days shall come upon you,
The Lord GOD hath sworn by himself, saith the LORD the God of hosts, I abhor the
The Lord then answered him, and said, [Thou] hypocrite, doth not each one of you
The Lord [is] at hand.
The Lord [be] with you all.
The Lord give mercy unto the house of Onesiphorus; for he oft refreshed me, and
The Lord grant unto him that he may find mercy of the Lord in that day: and in h
The Lord Jesus Christ [be] with thy spirit.
The Lord knoweth how to deliver the godly out of temptations, and to reserve the
The Lord is not slack concerning his promise, as some men count slackness; but i
done

HSE

Quote from: NoCforMe on July 09, 2023, 06:27:30 AMit kind of writes itself.

:biggrin: I'm very used to "a mess of conditional statements"

Thanks, that will help  :thumbsup:

HSE
Equations in Assembly: SmplMath

NoCforMe

JJ, try to think of a better example. What you showed is basically a text search; how about something where you're extracting values from a text construct, along the lines of my "hotspot=<x>,<y>", but a little less trivial. Something I can get my parsing teeth into.
Assembly language programming should be fun. That's why I do it.

jj2007

Sorry, I don't understand what you mean - maybe you should give it a try?

Honestly, I am curious to see a concrete example of your approach.

HSE

 :biggrin: that look better, in code:;*****  Set up for ID storage:  *****
_pnA LABEL NEAR
    lea rcx, TextBuffer       
    mov TextBufferPtr, rcx
    mov rbx, $tokenRow1

; ... fall through to ...


because in article say:***** Set up for ID storage: *****
_pnA LABEL NEAR
MOV TextBufferPtr, OFFSET TextBuffer
MOV EBX, $tokenRow1
JMP parse0

wich make me lost the path  :rolleyes:
Equations in Assembly: SmplMath

NoCforMe

Yes, that is a little misleading. I should probably provide the entire example in the article to make that more clear.
Assembly language programming should be fun. That's why I do it.

mineiro

I give to you a challenge sir NoCforMe.  I understood how much powerfull it's what you posted. This can be a great project if persons want to create plugins to some text file editors.

Translate line bellow:
invoke function, addr something, 1, 2, 3

to strings:
mov r9,3
mov r8,2
mov rdx,1
lea rcx, addr something
call function

Rules:
addr inside invoke ID should be translated as "lea" instead of "mov".
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

really sorry, I can't modify my error:
So,
lea rcx, addr something

to:
lea rcx, something

Well, 7 beer effect.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

NoCforMe

Excellent challenge! Just what the doctor ordered.

I'm on it.

First task: clearly identify the problem.
To generalize, we have as input:

INVOKE f, s, a1, a2, a3

where

  • f can be:

    • the name of a function
    • a register holding the function's address
    • a variable holding the function's address
  • s can be:

    • the address of a variable (ADDR var)
    • the value of a variable (var)
    • a register holding a value
  • a1, a2 and a3 can be:

    • the address of a variable (ADDR var)
    • the value of a variable (var)
    • a register holding a value[\li]
OK? So far so good. We want to transform this to the form you gave in your challenge.

Since the first task of parsing is tokenization, we need to define the universe of tokens here:

  • unknown IDs
  • numbers (decimal or hex)
  • known IDs:

    • INVOKE
    • ADDR
  • comma
  • space (space or tab)
  • EOL (end-of-line marker)
That's it. We don't need to include all the possible register names because those will simply be treated as "unknown IDs", just like any other variable.

I'll work on the tokenizer and get back when it's done.
Assembly language programming should be fun. That's why I do it.

mineiro

Sir, you went further than I asked, make things simple so people can understand how powerful a lexical scanner is.
If you want to create what you proposed, then release two versions.
Leave the rest to modular programming, people's cognitive parallax.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything