News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing, part 3 (coding the tokenizer)

Started by NoCforMe, September 18, 2022, 06:17:46 AM

Previous topic - Next topic

NoCforMe

OK, so since you're anxious to get on to actually seeing the tokenizer code, we'll get into it here. As I wrote earlier, that code will all come pretty much directly from all that paperwork we've done. The sources of the code are 1) the tokenizer table and 2) the state/node descriptions.

The Tokenizer Table

As I showed earlier, that table gets coded almost directly from the one on paper. And this generates no code, only data. Referring to the image below, here's what we put in our data section:


CfgTokenTable LABEL DWORD
;     sp      tab    CR     LF     =      "      ;     EOF   [any]
; -------------------------------------------------------------------
DD _cfg0, _cfg0, _cfg0, _cfg0, _cfgF, _cfgD, _cfg4, _cfgG, _cfgA ;0
DD _cfgC, _cfgC, _cfgC, _cfgC, _cfgB, _cfgX, _cfgB, _cfgB, _cfg1 ;1
DD _cfg2, _cfg2, _cfg2, _cfg2, _cfg2, _cfg3, _cfg2, _cfgX, _cfg2 ;2
DD _cfgE, _cfgE, _cfgE, _cfgE, _cfgE, _cfg2, _cfgE, _cfgE, _cfgE ;3
DD _cfg4, _cfg4, _cfgH, _cfgH, _cfg4, _cfg4, _cfg4, _cfgG, _cfg4 ;4


CfgTokenChars DB ' ', $tab, $CR, $LF, '=', '"', ';', $EOF
$numCfgTokenChars EQU $ - CfgTokenChars

$cfgTokenRow1 EQU ($numCfgTokenChars + 1) * 4 ;DWORD offset, extra bump for "[any]".
$cfgTokenRow2 EQU $cfgTokenRow1 * 2
$cfgTokenRow3 EQU $cfgTokenRow1 * 3
$cfgTokenRow4 EQU $cfgTokenRow1 * 4

$cfgTokenAnyOffset EQU $numCfgTokenChars * 4


There are two data items here: the table, and the list of tokenizing characters (CfgTokenChars). In addition, there are some essential symbolic constants (EQUates) that we'll need in the tokenizer. These include the number of tokenizing characters, the width of a row of the table in bytes, offsets to the other rows and an offset to the "any" column of any row. Notice how we don't need to encode any numbers: the assembler does all the computations for us. We don't even need to count on our fingers!

For now we'll just make up names for these entries (_cfg0, _cfgA) that correspond to the state/node names on our diagram. Later we'll actually create these nodes in code.

OK, that's all we need in our .data section. On to the code. Here's the heart of the tokenizer itself:


GetNextCFGtoken PROC
LOCAL textPtr:DWORD

PUSH EBX
PUSH EDI
XOR EBX, EBX ;Start @ parse table row 0.

_cfg0 LABEL NEAR
CALL GNC ;Get next char.

; Try to match any parsing characters:
LEA EDI, CfgTokenChars
MOV EDX, EDI ;Save pointer to list of chars.
MOV ECX, $numCfgTokenChars
REPNE SCASB
JNE cfg20 ;No match, go to "any" col.

SUB EDI, EDX ;Get offset from start of list.
DEC EDI
SHL EDI, 2 ;Convert to DWORD offset.
JMP [EBX + EDI + CfgTokenTable]

; No match, so go to "any char." column:
cfg20: JMP [EBX + CfgTokenTable + $cfgTokenAnyOffset]


That's it for the dispatching code. Again, simple. Here's what's going on:

EBX is used as the state variable, as it's a non-volatile register. It's actually a pointer to the current row in the token table. We start it at 0, the top row.

GNC() is called to get the next character from the file buffer. The next code tries to match that character, literally, against the list of tokenizing characters using REPNE SCASB (the ever-useful string instructions!). If it finds a match, we find the matching character's offset from the start of the list, convert it to a DWORD offset, and jump indirectly through that entry in the table. If there's no match, then it's an "any" character, so we jump through the last entry in the row.

The Tokenizing Nodes

Now we have a nice dispatcher, but where's it gonna send us to through that jump table? The rest of the tokenizer is just a bunch of small code stubs, each of which does what we indicated on our paper diagram.

Here are all the states/nodes for this tokenizer:


;*****  CFG tokenizing nodes:  *****

;***** Prepare ID storage:  *****
_cfgA LABEL NEAR
MOV EDX, LargeTextBuffer
MOV textPtr, EDX
MOV EBX, $cfgTokenRow1

; ... fall through to ...

;***** Store ID character:  *****
_cfg1 LABEL NEAR
MOV EDX, textPtr
MOV [EDX], AL
INC textPtr
JMP _cfg0


;***** Push back char., return ID:  *****
_cfgB LABEL NEAR
CALL PBC

; ... fall through to ...

;***** Return ID:  *****
_cfgC LABEL NEAR
MOV EDX, textPtr
MOV BYTE PTR [EDX], 0 ;Terminate text.
MOV EAX, $T_ID

;***** Exit here:  *****
exit99: POP EDI
POP EBX
RET


;***** Prepare string storage:  *****
_cfgD LABEL NEAR
MOV EAX, LargeTextBuffer
MOV textPtr, EAX
MOV EBX, $cfgTokenRow2
JMP _cfg0


;***** Store "text" character:  *****
_cfg2 LABEL NEAR
MOV EDX, textPtr
MOV [EDX], AL
INC textPtr
MOV EBX, $cfgTokenRow2
JMP _cfg0


;***** Scan-only node:  *****
_cfg3 LABEL NEAR
MOV EBX, $cfgTokenRow3
JMP _cfg0


;***** Scan off comment:  *****
_cfg4 LABEL NEAR
MOV EBX, $cfgTokenRow4
JMP _cfg0


;***** Push back char., return "text":  *****
_cfgE LABEL NEAR
CALL PBC
MOV EDX, textPtr
MOV BYTE PTR [EDX], 0 ;Terminate text.
MOV EAX, $T_string
JMP exit99


;***** Return '=':  *****
_cfgF LABEL NEAR
MOV EAX, $T_equals
JMP exit99


;***** Return EOF:  *****
_cfgG LABEL NEAR
MOV EAX, $T_EOF
JMP exit99


;***** Reset row pointer to start:  *****
_cfgH LABEL NEAR
XOR EBX, EBX
JMP _cfg0


;***** Return error: *****
_cfgX LABEL NEAR
MOV EAX, $T_error
JMP exit99


Notice how each node is defined as a "LABEL NEAR". We can't use just regular jump-destination labels here, because those are only local to a PROC. The LABELs are global, so we can plug them into our jump table.

Notice how small each node is! The longest one here is 5 statements. Some of them, like the one for node 4, only set EBX to the state and then jump back to the top of the tokenizer to get another character. The nodes that return a result put the token value (like $T_string) into EAX and return.

A bit more code for our next-character-getter (GNC() ) and the character pusher-back (PBC () ) complete our tokenizer:


GNC PROC

CMP PushedBackChar, 0
JNE getPB

; Get character from file buffer:
MOV EAX, FileBufferPtr
INC FileBufferPtr
MOV AL, [EAX]
RET

getPB: MOV AL, PushedBackChar
MOV PushedBackChar, 0
RET

GNC ENDP

PBC PROC

MOV PushedBackChar, AL
RET

PBC ENDP


There are a couple of things I forgot to mention about our tokenizer. One is that it can scan comments, and the other is how it allows for graceful parsing.

Since it's so cheap and easy to do, I built in the ability to put comments in the text here. They work just like in assembler: anything after a ';' character is considered a comment, and is simply "scanned off" to the end of the line. The nice thing about this is that the parser never even sees the comment; it gets "eaten" by the tokenizer.

Regarding graceful parsing, notice the occurrence of space characters in the tokenizing diagram. What this means is that the tokenizer is flexible in how the user formats their statements. Specifically, it allows white-space characters (spaces and tabs) anywhere in a statement. The lines below are all tokenized identically:


displayText="string"
   displayText = "string"
   displayText <tab> = <tab> "string"


Since it's so trivially easy to accomodate this, it makes you wonder why all parsers don't do this. Don't you hate those web sites that ding you with an error of you don't type your phone # or credit card # or other item in exactly their prescribed format? C'mon,this is what computers are for!

OK, so that pretty much completes the tokenization process. Next time we'll look at the overall parsing procedure. More paperwork ahead! and then some coding.
Assembly language programming should be fun. That's why I do it.

HSE

Hi NoC!

Very interesting  :thumbsup:

To put GNC code in a proc waste 4 ops time for each byte.

To replace CfgTokenChars with a 256 bytes array can save close to 40 ops time for each byte.

HSE
Equations in Assembly: SmplMath

NoCforMe

Quote from: HSE on September 19, 2022, 07:14:53 AM
Hi NoC!

Very interesting  :thumbsup:

To put GNC code in a proc waste 4 ops time for each byte.

Thanks, but I'm really not interested in that level of performance here. I'm not writing a game. 4 clocks/byte is swamped by the time the user of my program sits staring dumbly at the blank screen, trying to figure out what the hell to do next ...

Quote
To replace CfgTokenChars with a 256 bytes array can save close to 40 ops time for each byte.

I am curious about this. How do you search the array--with string instructions (like REP SCASB), which is what I'm already doing? How would a 256-byte array be faster?

Oh, I think I know what you're getting at: XLAT. Right? Point EBX at the table and go? Wow, I haven't used that since my DOS programming days, and hadn't even thought of it until now. But again, don't need that kind of speed. Thanks anyhow.

BTW, for anyone interested who doesn't know about XLAT, it's killer for things like converting binary to hexadecimal. In a flash, just using a small lookup table.
Assembly language programming should be fun. That's why I do it.