Parsing, part 3 (coding the tokenizer)

NoCforMe · September 18, 2022, 06:17:46 AM

OK, so since you're anxious to get on to actually seeing the tokenizer code, we'll get into it here. As I wrote earlier, that code will all come pretty much directly from all that paperwork we've done. The sources of the code are 1) the tokenizer table and 2) the state/node descriptions.

The Tokenizer Table

As I showed earlier, that table gets coded almost directly from the one on paper. And this generates no code, only data. Referring to the image below, here's what we put in our data section:

Code Select


CfgTokenTable	LABEL DWORD
;	    sp      tab    CR     LF     =      "      ;     EOF   [any]
;	-------------------------------------------------------------------
	DD _cfg0, _cfg0, _cfg0, _cfg0, _cfgF, _cfgD, _cfg4, _cfgG, _cfgA	;0
	DD _cfgC, _cfgC, _cfgC, _cfgC, _cfgB, _cfgX, _cfgB, _cfgB, _cfg1	;1
	DD _cfg2, _cfg2, _cfg2, _cfg2, _cfg2, _cfg3, _cfg2, _cfgX, _cfg2	;2
	DD _cfgE, _cfgE, _cfgE, _cfgE, _cfgE, _cfg2, _cfgE, _cfgE, _cfgE	;3
	DD _cfg4, _cfg4, _cfgH, _cfgH, _cfg4, _cfg4, _cfg4, _cfgG, _cfg4	;4


CfgTokenChars		DB ' ', $tab, $CR, $LF, '=', '"', ';', $EOF
 $numCfgTokenChars	EQU $ - CfgTokenChars

$cfgTokenRow1		EQU ($numCfgTokenChars + 1) * 4	;DWORD offset, extra bump for "[any]".
$cfgTokenRow2		EQU $cfgTokenRow1 * 2
$cfgTokenRow3		EQU $cfgTokenRow1 * 3
$cfgTokenRow4		EQU $cfgTokenRow1 * 4

$cfgTokenAnyOffset	EQU $numCfgTokenChars * 4

There are two data items here: the table, and the list of tokenizing characters (CfgTokenChars). In addition, there are some essential symbolic constants (EQUates) that we'll need in the tokenizer. These include the number of tokenizing characters, the width of a row of the table in bytes, offsets to the other rows and an offset to the "any" column of any row. Notice how we don't need to encode any numbers: the assembler does all the computations for us. We don't even need to count on our fingers!

For now we'll just make up names for these entries (_cfg0, _cfgA) that correspond to the state/node names on our diagram. Later we'll actually create these nodes in code.

OK, that's all we need in our .data section. On to the code. Here's the heart of the tokenizer itself:

Code Select


GetNextCFGtoken	PROC
	LOCAL	textPtr:DWORD

	PUSH	EBX
	PUSH	EDI
	XOR	EBX, EBX		;Start @ parse table row 0.

_cfg0	LABEL NEAR
	CALL	GNC			;Get next char.

; Try to match any parsing characters:
	LEA	EDI, CfgTokenChars
	MOV	EDX, EDI		;Save pointer to list of chars.
	MOV	ECX, $numCfgTokenChars
	REPNE	SCASB
	JNE	cfg20			;No match, go to "any" col.

	SUB	EDI, EDX		;Get offset from start of list.
	DEC	EDI
	SHL	EDI, 2			;Convert to DWORD offset.
	JMP	[EBX + EDI + CfgTokenTable]

; No match, so go to "any char." column:
cfg20:	JMP	[EBX + CfgTokenTable + $cfgTokenAnyOffset]

That's it for the dispatching code. Again, simple. Here's what's going on:

EBX is used as the state variable, as it's a non-volatile register. It's actually a pointer to the current row in the token table. We start it at 0, the top row.

GNC() is called to get the next character from the file buffer. The next code tries to match that character, literally, against the list of tokenizing characters using REPNE SCASB (the ever-useful string instructions!). If it finds a match, we find the matching character's offset from the start of the list, convert it to a DWORD offset, and jump indirectly through that entry in the table. If there's no match, then it's an "any" character, so we jump through the last entry in the row.

The Tokenizing Nodes

Now we have a nice dispatcher, but where's it gonna send us to through that jump table? The rest of the tokenizer is just a bunch of small code stubs, each of which does what we indicated on our paper diagram.

Here are all the states/nodes for this tokenizer:

Code Select


;*****  CFG tokenizing nodes:  *****

;*****	Prepare ID storage:  *****
_cfgA	LABEL NEAR
	MOV	EDX, LargeTextBuffer
	MOV	textPtr, EDX
	MOV	EBX, $cfgTokenRow1

; ...	fall through to	...

;*****	Store ID character:  *****
_cfg1	LABEL NEAR
	MOV	EDX, textPtr
	MOV	[EDX], AL
	INC	textPtr
	JMP	_cfg0


;*****	Push back char., return ID:  *****
_cfgB	LABEL NEAR
	CALL	PBC

; ...	fall through to ...

;*****	Return ID:  *****
_cfgC	LABEL NEAR
	MOV	EDX, textPtr
	MOV	BYTE PTR [EDX], 0	;Terminate text.
	MOV	EAX, $T_ID

;*****	Exit here:  *****
exit99:	POP	EDI
	POP	EBX
	RET


;*****	Prepare string storage:  *****
_cfgD	LABEL NEAR
	MOV	EAX, LargeTextBuffer
	MOV	textPtr, EAX
	MOV	EBX, $cfgTokenRow2
	JMP	_cfg0


;*****	Store "text" character:  *****
_cfg2	LABEL NEAR
	MOV	EDX, textPtr
	MOV	[EDX], AL
	INC	textPtr
	MOV	EBX, $cfgTokenRow2
	JMP	_cfg0


;*****	Scan-only node:  *****
_cfg3	LABEL NEAR
	MOV	EBX, $cfgTokenRow3
	JMP	_cfg0


;*****	Scan off comment:  *****
_cfg4	LABEL NEAR
	MOV	EBX, $cfgTokenRow4
	JMP	_cfg0


;*****	Push back char., return "text":  *****
_cfgE	LABEL NEAR
	CALL	PBC
	MOV	EDX, textPtr
	MOV	BYTE PTR [EDX], 0	;Terminate text.
	MOV	EAX, $T_string
	JMP	exit99


;*****	Return '=':  *****
_cfgF	LABEL NEAR
	MOV	EAX, $T_equals
	JMP	exit99


;*****	Return EOF:  *****
_cfgG	LABEL NEAR
	MOV	EAX, $T_EOF
	JMP	exit99


;*****	Reset row pointer to start:  *****
_cfgH	LABEL NEAR
	XOR	EBX, EBX
	JMP	_cfg0


;*****	Return error: *****
_cfgX	LABEL NEAR
	MOV	EAX, $T_error
	JMP	exit99

Notice how each node is defined as a "LABEL NEAR". We can't use just regular jump-destination labels here, because those are only local to a PROC. The LABELs are global, so we can plug them into our jump table.

Notice how small each node is! The longest one here is 5 statements. Some of them, like the one for node 4, only set EBX to the state and then jump back to the top of the tokenizer to get another character. The nodes that return a result put the token value (like $T_string) into EAX and return.

A bit more code for our next-character-getter (GNC() ) and the character pusher-back (PBC () ) complete our tokenizer:

Code Select


GNC		PROC

	CMP	PushedBackChar, 0
	JNE	getPB

; Get character from file buffer:
	MOV	EAX, FileBufferPtr
	INC	FileBufferPtr
	MOV	AL, [EAX]
	RET

getPB:	MOV	AL, PushedBackChar
	MOV	PushedBackChar, 0
	RET

GNC		ENDP

PBC		PROC

	MOV	PushedBackChar, AL
	RET

PBC		ENDP

There are a couple of things I forgot to mention about our tokenizer. One is that it can scan comments, and the other is how it allows for graceful parsing.

Since it's so cheap and easy to do, I built in the ability to put comments in the text here. They work just like in assembler: anything after a ';' character is considered a comment, and is simply "scanned off" to the end of the line. The nice thing about this is that the parser never even sees the comment; it gets "eaten" by the tokenizer.

Regarding graceful parsing, notice the occurrence of space characters in the tokenizing diagram. What this means is that the tokenizer is flexible in how the user formats their statements. Specifically, it allows white-space characters (spaces and tabs) anywhere in a statement. The lines below are all tokenized identically:

Code Select


displayText="string"
   displayText = "string"
   displayText <tab> = <tab> "string"

Since it's so trivially easy to accomodate this, it makes you wonder why all parsers don't do this. Don't you hate those web sites that ding you with an error of you don't type your phone # or credit card # or other item in exactly their prescribed format? C'mon,this is what computers are for!

OK, so that pretty much completes the tokenization process. Next time we'll look at the overall parsing procedure. More paperwork ahead! and then some coding.

HSE · September 19, 2022, 07:14:53 AM

Hi NoC!

Very interesting

To put GNC code in a proc waste 4 ops time for each byte.

To replace CfgTokenChars with a 256 bytes array can save close to 40 ops time for each byte.

HSE

NoCforMe · September 19, 2022, 07:25:54 AM

Quote from: HSE on September 19, 2022, 07:14:53 AM
Hi NoC!

Very interesting

To put GNC code in a proc waste 4 ops time for each byte.

Thanks, but I'm really not interested in that level of performance here. I'm not writing a game. 4 clocks/byte is swamped by the time the user of my program sits staring dumbly at the blank screen, trying to figure out what the hell to do next ...

Quote
To replace CfgTokenChars with a 256 bytes array can save close to 40 ops time for each byte.

I am curious about this. How do you search the array--with string instructions (like REP SCASB), which is what I'm already doing? How would a 256-byte array be faster?

Oh, I think I know what you're getting at: XLAT. Right? Point EBX at the table and go? Wow, I haven't used that since my DOS programming days, and hadn't even thought of it until now. But again, don't need that kind of speed. Thanks anyhow.

BTW, for anyone interested who doesn't know about XLAT, it's killer for things like converting binary to hexadecimal. In a flash, just using a small lookup table.

jj2007 · September 19, 2022, 07:54:56 AM

xlat is pretty fast

The MASM Forum

News:

Parsing, part 3 (coding the tokenizer)

NoCforMe

HSE

NoCforMe

jj2007