News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing, part 1

Started by NoCforMe, September 17, 2022, 03:34:34 PM

Previous topic - Next topic

NoCforMe

This is the first of a several-part series on a computer-programming subject that's near and dear to my heart: parsing. This is a process that most of us who deal with written text in any way will find useful, so I want to go in-depth and show the techniques I use to parse text.

So what is parsing, anyhow? As a first cut, it's a way to get a computer to understand the intentions of a user by interpreting text. For example, figuring out the meaning of a simple initialization file, like this:


fontFace="Baskerville"
fontSize=24
fontWeight=bold


This is a pretty simple example. A much more complex example of parsing would be the entire contents of an assembler source file, from which the assembler must extract the exact intentions of the programmer who wrote it and turn it into object code.

I'm going to describe a parser I recently built for my EdAsm editor, which parses a configuration file that loads items into a listbox from which the user can select things to insert into their text.

Let's refine the definition of parsing a little bit more. Technically there are two distinct phases of computer parsing: tokenization and lexical analysis. (I'm going to fudge a little bit and call the second phase "parsing", which isn't technically kosher.) The first process basically breaks the text into the smallest units that the parser needs to operate on, called "tokens". A token can be a single character, a number, a string of characters or what I call an "identifier", like the names we use in our ASM programs, a contiguous string of characters with no spaces.

After tokenization, the parser combs through the sequence of tokens and tries to make sense out of them. You'll see how this works shortly.

My Tokenization Technique

Unlike most code I've seen to do this, I don't rely on a mess of conditionals and jumps all over the place to extract tokens. I've settled on what I think is a much better technique, one that's easier to implement and much, much easier to expand or modify as needed. What I use is, to use a fancy term, a finite state automaton, or FSA (also known as a "state machine"). This is something I picked up years ago skimming through a computer science text, not Knuth but some other similar book. With an FSA, you start at a beginning state, and each character you read makes the machine move to another state until the end of the text. So instead of being a tangle of IF ... ELSEIF ... statements, it's ... well, it's different. Let me illustrate.

My editor's config file consists of the following types of statements:


[inserts]
[insert]
displayText="text"
insertText="text"

[insert]

(... etc)

[goodies]
[inserts]
[insert]
displayText="text"
insertText="text"

[insert]

(... etc)


kind of like a standard Windows private profile file. So our first problem is, how should the tokenizer work? What tokens do we look for?

Well, looking at the sample text, we can see that we have:

  • identifiers ("[inserts]", "[insert]", "displayText")
  • strings ("text")
  • the equal sign ('=')
And it turns out there's another important token we need, which is EOF, meaning an indicator that we've hit the end-of-file.

OK, so how do we extract those tokens? At this point, we do not sit down at our computer and start typing. No, no, no. We sit down at a table with a blank sheet of paper and a pen or pencil. Then we draw out our FSA, our state machine. This is shown in the image below here.

So what does this diagram mean? It shows graphically the progression of tokenization, where we examine the text stream one character at a time and advance to the next state based on what that character is. The states are the circles with numbers (0 through 4). The squares are intermediate nodes that perform some function, like setting up a character buffer or returning a token value. To create this diagram we need to think through the sequence of steps that it will take to interpret a particular token, on paper, following from node to node.

Let's say that the next token in our stream is going to be an "identifier". Remember that this is a contiguous string of characters, with no spaces embedded. Starting at state 0, there's a little box marked "any"; this means "any character other than other characters listed", which in this case would be any alphabetic character or numeric digit, or any punctuation character not listed in the state table top line. In our case, we're looking for something like "[goodies]" or "displayText". So that "any" box leads us to node A, where we see that its function is to "prepare ID storage". This node leads directly to state 2. At state 2, "any" character leads back to the same state, and we see that this state's function is to store that character. (Other states may have no function at all, except to lead to other states.) Then there are a bunch of other boxes leading out of state 2 to other nodes, which I'll explain later. Let's just look at one, the space character ("sp"). This leads to node C, whose function is to "return ID". This means that the return value of this code (it's a subroutine) is "ID". Since an identifier can't contain spaces, seeing a space in the text stream must mean that we've just hit the end of the identifier, so it's been tokenized. The actual identifier text is stored in some memory location where the parser can examine it.

Confused? It's a bit of a nonstandard way of coding, so it may take a while to wrap your head around it. But believe me, it works like a charm! (I've verified this diagram by coding it into my program.)

OK, so that's the state machine diagram. But what good does that do us? Well, the beautiful thing about this is that after scratching our head and coming up with this diagram, the code to drive it practically writes itself. The core part of this is the state table; this is something that will be written into our code almost exactly as it is here on paper. Or rather written into data; as I said, this is a data-driven technique.

The state table is just a 2-dimensional chart that shows what node to go to depending on what state (row) you're on and what the next character is. So for example, if our state=2 (we're on the third row) and we see a double quote character, we go to state 3. This table is filled in just by looking at the FSA diagram and seeing what goes where. If there's no path for a certain character at that place (for example, we're at state 2 and we see EOF, which isn't in our diagram at that state), we mark in an "X" to indicate an error for that character. (That's because if we're tokenizing a quoted string, hitting EOF before finding the ending quote would be a syntax error, since we want our strings to end with quotes.)

So look at the state table in the image below, then check out this actual data item derived from it:


CfgTokenTable LABEL DWORD
;     sp      tab    CR     LF     =      "      ;     EOF   [any]
; -------------------------------------------------------------------
DD _cfg0, _cfg0, _cfg0, _cfg0, _cfgF, _cfgD, _cfg4, _cfgG, _cfgA ;0
DD _cfgC, _cfgC, _cfgC, _cfgC, _cfgB, _cfgX, _cfgB, _cfgB, _cfg1 ;1
DD _cfg2, _cfg2, _cfg2, _cfg2, _cfg2, _cfg3, _cfg2, _cfgX, _cfg2 ;2
DD _cfgE, _cfgE, _cfgE, _cfgE, _cfgE, _cfg2, _cfgE, _cfgE, _cfgE ;3
DD _cfg4, _cfg4, _cfgH, _cfgH, _cfg4, _cfg4, _cfg4, _cfgG, _cfg4 ;4


Look familiar? Like I said, it practically writes itself. The comment labels across the top identify the characters seen. The items in the table, _cfg0, _cfgA, etc., are code labels, the addresses of the nodes. It's a jump table. All you need to do is copy it off the paper table.

Now, this tokenizer is a bit more complex than my usual one because I needed to preserve the formatting of text strings, including spaces, tabs, carriage returns and line feeds, which is why all those characters appear in the list. Usually I don't care about such niceties and can treat tabs and spaces identically (just "white space"), and I can throw away carriage returns and just use line feeds to indicate end-of-line. But it didn't cost much more to put them into the table--it's just a bunch of DWORDs, not a single line of code--so why not?

OK, that's part one. Next we'll look at the actual tokenization code and see how it works, and how to construct those nodes. Believe me, that part is also simple. So much simpler than endless IF ... ELSEIF ... spaghetti code, and like I said, easily expanded or modified if needed.
Assembly language programming should be fun. That's why I do it.

learn64bit

Holy... Is that a hand drawing picture? (Sorry, this is my first time seening a this style of design Blueprint)

hutch--

Hi David,

Looks like you have been busy lately. I have not fully digested the whole idea but its looking like a tree structure. You may have to find a different label for it but its looking like an interesting idea.

daydreamer

David,interesting idea to use FSM to other things than some simple enemy AI :thumbsup:
its also possible to draw it on a tablet which comes with pen
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

NoCforMe

Quote from: hutch-- on September 18, 2022, 03:53:59 AM
Hi David,

Looks like you have been busy lately. I have not fully digested the whole idea but its looking like a tree structure. You may have to find a different label for it but its looking like an interesting idea.

Hey, Steve: No, not a tree structure. Not sure exactly what topological-type critter it is. All I know is that it works like a charm!
Assembly language programming should be fun. That's why I do it.

TimoVJL

Seems like it could be presented in treeview component, as it is multilevel file.
May the source be with you

NoCforMe

@Timo: Sorry, no, not that kind of structure. Not sure what you're getting at here.
Assembly language programming should be fun. That's why I do it.

zedd151

Quote from: NoCforMe on September 18, 2022, 04:56:40 AMNot sure exactly what topological-type critter it is.
A hydra with a few extra appendages?  :greensml:
Quote from: NoCforMeAll I know is that it works like a charm!
That's really the only thing that matters really, no matter what the diagram looks like or anyone else thinks.  :thumbsup:

NoCforMe

OK, that does it: until someone comes up with a better name, this structure will officially be known as a "hydra".
Assembly language programming should be fun. That's why I do it.

mikeburr

well
With an FSA, you start at a beginning state, and each character you read makes the machine move to another state until the end of the text.
is pretty much the same thing .. you are using a state table [read jump table] which is just another way of encoding "a load of if statements" .. its just a bit more modular
regards mike b

hutch--

I picked up the name "finite state machine" from Randy Hyde years ago which is a fancy name for a static tree of characters. You feed a string into it and get whatever type of return you have designed for the output, either not found as a zero or the content if the string is found.

This is the input.

one 1
two 2
three 3
four 4
five 5
six 6
seven 7
eight 8
nine 9
ten 10

Here is the output of a tree of this type.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

numbers proc

    cmp BYTE PTR [rcx+0], "e"
    jne lbl0
    cmp BYTE PTR [rcx+1], "i"
    jne notfound
    cmp BYTE PTR [rcx+2], "g"
    jne notfound
    cmp BYTE PTR [rcx+3], "h"
    jne notfound
    cmp BYTE PTR [rcx+4], "t"
    jne notfound
    cmp BYTE PTR [rcx+5], 0
    jne notfound
  ; -------------------
    mov rax, 8   ; eight
    ret
  ; -------------------
  lbl0:
    cmp BYTE PTR [rcx+0], "f"
    jne lbl1
    cmp BYTE PTR [rcx+1], "i"
    jne lbl2
    cmp BYTE PTR [rcx+2], "v"
    jne notfound
    cmp BYTE PTR [rcx+3], "e"
    jne notfound
    cmp BYTE PTR [rcx+4], 0
    jne notfound
  ; -------------------
    mov rax, 5   ; five
    ret
  ; -------------------
  lbl2:
    cmp BYTE PTR [rcx+1], "o"
    jne notfound
    cmp BYTE PTR [rcx+2], "u"
    jne notfound
    cmp BYTE PTR [rcx+3], "r"
    jne notfound
    cmp BYTE PTR [rcx+4], 0
    jne notfound
  ; -------------------
    mov rax, 4   ; four
    ret
  ; -------------------
  lbl1:
    cmp BYTE PTR [rcx+0], "n"
    jne lbl3
    cmp BYTE PTR [rcx+1], "i"
    jne notfound
    cmp BYTE PTR [rcx+2], "n"
    jne notfound
    cmp BYTE PTR [rcx+3], "e"
    jne notfound
    cmp BYTE PTR [rcx+4], 0
    jne notfound
  ; -------------------
    mov rax, 9   ; nine
    ret
  ; -------------------
  lbl3:
    cmp BYTE PTR [rcx+0], "o"
    jne lbl4
    cmp BYTE PTR [rcx+1], "n"
    jne notfound
    cmp BYTE PTR [rcx+2], "e"
    jne notfound
    cmp BYTE PTR [rcx+3], 0
    jne notfound
  ; -------------------
    mov rax, 1   ; one
    ret
  ; -------------------
  lbl4:
    cmp BYTE PTR [rcx+0], "s"
    jne lbl5
    cmp BYTE PTR [rcx+1], "e"
    jne lbl6
    cmp BYTE PTR [rcx+2], "v"
    jne notfound
    cmp BYTE PTR [rcx+3], "e"
    jne notfound
    cmp BYTE PTR [rcx+4], "n"
    jne notfound
    cmp BYTE PTR [rcx+5], 0
    jne notfound
  ; -------------------
    mov rax, 7   ; seven
    ret
  ; -------------------
  lbl6:
    cmp BYTE PTR [rcx+1], "i"
    jne notfound
    cmp BYTE PTR [rcx+2], "x"
    jne notfound
    cmp BYTE PTR [rcx+3], 0
    jne notfound
  ; -------------------
    mov rax, 6   ; six
    ret
  ; -------------------
  lbl5:
    cmp BYTE PTR [rcx+0], "t"
    jne notfound
    cmp BYTE PTR [rcx+1], "e"
    jne lbl7
    cmp BYTE PTR [rcx+2], "n"
    jne notfound
    cmp BYTE PTR [rcx+3], 0
    jne notfound
  ; -------------------
    mov rax, 10   ; ten
    ret
  ; -------------------
  lbl7:
    cmp BYTE PTR [rcx+1], "h"
    jne lbl8
    cmp BYTE PTR [rcx+2], "r"
    jne notfound
    cmp BYTE PTR [rcx+3], "e"
    jne notfound
    cmp BYTE PTR [rcx+4], "e"
    jne notfound
    cmp BYTE PTR [rcx+5], 0
    jne notfound
  ; -------------------
    mov rax, 3   ; three
    ret
  ; -------------------
  lbl8:
    cmp BYTE PTR [rcx+1], "w"
    jne notfound
    cmp BYTE PTR [rcx+2], "o"
    jne notfound
    cmp BYTE PTR [rcx+3], 0
    jne notfound
  ; -------------------
    mov rax, 2   ; two
    ret
  ; -------------------
  notfound:
    xor rax, rax
    ret

numbers endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤