News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Parsing part 4: the parser

Started by NoCforMe, September 18, 2022, 10:01:14 AM

Previous topic - Next topic

NoCforMe

Up to now we've only dealt with the front part of the parsing process, tokenization. Now we have a reliable engine that will split a text stream up into its molecular components. From a file containing the following text


[inserts]
[insert]
displayText="some text"
insertText="some other text"

[insert]
insertText="yet more text"
displayText = "yet more insert text"


we get the following tokens:

  • [inserts]
  • [insert]
  • displayText
  • insertText
  • =
  • "some text"
  • "some more text"
and so on. Fine; now what do we do with these tokens?

Well, it's now time to turn off the computer and get out another piece of paper. This time we're going to define the parsing sequences we need to make sense of this text file. These sequences will be our guide to following a chain of tokens to parse the file, just as we followed a chain of characters to tokenize the text. We're going to define what the legal statements are in our file.

You can diagram this any way you like; I've settled on the format shown in the image below. This is similar to the state machine we used for tokenization, except that it operates on a higher level. Starting at the top, we see all the tokens that can be found at that level: [inserts], [goodies] and EOF. (What this means is that it would be legal to parse a null file, that is, one that simply gives EOF on parsing, in which case nothing would be emitted by the parser.)

For example, let's look at the first set of statements in the example text above: we have, in order

  • [inserts]
  • [insert]
  • displayText
  • =
  • "some text"
  • insertText
  • =
  • "some other text

which constitutes a perfectly legal, understandable (to the parser) construct which creates one entry in our listbox.

So we create all these pathways in our diagram. And like the way our tokenizer practically wrote itself from its diagram, we'll be able to write code (again, actually data) directly from this diagram.

Before we go to the code, a couple things about this diagram: first of all, you may notice that there are two pathways for each [insert] item, one of which has "displayText" followed by "insertText", the other in the opposite order. Why do this? This is another example of "graceful parsing", where we don't force the user to do something arbitrarily because we're basically lazy. Look at the example text above and notice that the order of these statements is reversed in one case. Since it doesn't matter to our program which order these two statements come in, why force the user to enter them in some order just because we didn't take the extra 5 minutes to accomodate them? The extra overhead in data (not) code in this case is exactly 144 bytes, not a bad bargain for very little work.

So how do we implement this? Well, since it's also data-driven, we'll need to define a data structure that will create each of these parsing nodes. The structure I've been using for literally decades now (copied & pasted many times) looks like this:


; Parsing node structure:
PNODE STRUC
  PN_next DD ? ;Node to go to if token seen.
  PN_tokenID DD ? ;Type of token to match.
  PN_function DD ? ;(Optional) function to call.
PNODE ENDS


The three members of this structure are:

  • PN_next is a pointer to the next node to go to if the token (below) is found by the tokenizer when we're on this node
  • PN_tokenID is that token value to match
  • PN_function is a pointer to a function to call when this node is hit (or NULL if there's no function)

The parsing sequence then becomes a linked list of these structures, each one pointing to another one. As we work our way through the text and travel the path through this sequence, the functions called by those nodes that have such a pointer will do the work needed: copying text to buffers, setting up other data structures, creating windows, whatever needs to be done as a result of the parsing.

The Parsing Code

Just as we did with the tokenizer, the parser consists of two basic parts: the dispatcher, which leads us through the maze of parsing nodes, and the nodes themselves, which consist of any functions called by those nodes (again, for those nodes that need them: most nodes do nothing except advance to the next node, so their function pointers are NULL). First of all, let's look at the actual parsing sequence, the chain of PNODE structures--again, no code here, just data:


CfgParsingSequence LABEL DWORD

; Start: "[inserts]" / "[goodies]" / EOF:
CPnode0 PNODE <CPnode1, $T_Inserts, Cfg_AddInsertsHdg>
PNODE <CPnode14, $T_Goodies, Cfg_AddGoodiesHdg>
PNODE <CPnode99, $T_EOF, Cfg_CreateToolList>
DD -1

; "[insert]:
CPnode1 PNODE <CPnode2, $T_Insert, Cfg_ClearCommit>
DD -1

; "displayText" / "insertText":
CPnode2 PNODE <CPnode3, $T_displayText, NULL>
PNODE <CPnode8, $T_insertText, NULL>
DD -1

; '=':
CPnode3 PNODE <CPnode4, $T_equals, NULL>
DD -1

; "text":
CPnode4 PNODE <CPnode5, $T_string, Cfg_AddDispTxt>
DD -1

; "insertText":
CPnode5 PNODE <CPnode6, $T_insertText, NULL>
DD -1

; '=':
CPnode6 PNODE <CPnode7, $T_equals, Cfg_SetInsrtType_Commit>
DD -1

; "text":
CPnode7 PNODE <CPnode13, $T_string, Cfg_AddInsrtTxt>
DD -1

; '=':
CPnode8 PNODE <CPnode9, $T_equals, NULL>
DD -1

; "text":
CPnode9 PNODE <CPnode10, $T_string, Cfg_AddInsrtTxt>
DD -1

; "displayText":
CPnode10 PNODE <CPnode11, $T_displayText, NULL>
DD -1

; '=':
CPnode11 PNODE <CPnode12, $T_equals, Cfg_SetInsrtType_Commit>
DD -1

; "text":
CPnode12 PNODE <CPnode13, $T_string, Cfg_AddDispTxt>
DD -1

; "[goodies]" / "[insert]" / EOF:
CPnode13 PNODE <CPnode14, $T_Goodies, Cfg_AddGoodiesHdg>
PNODE <CPnode2, $T_Insert, Cfg_ClearCommit>
PNODE <CPnode99, $T_EOF, Cfg_CreateToolList>
DD -1

; "[insert]":
CPnode14 PNODE <CPnode15, $T_Insert, Cfg_ClearCommit>

; "displayText" / "insertText":
CPnode15 PNODE <CPnode16, $T_displayText, NULL>
PNODE <CPnode21, $T_insertText, NULL>
DD -1

; '=':
CPnode16 PNODE <CPnode17, $T_equals, NULL>
DD -1

; "text":
CPnode17 PNODE <CPnode18, $T_string, Cfg_AddDispTxt>
DD -1

; "insertText":
CPnode18 PNODE <CPnode19, $T_insertText, NULL>
DD -1

; '=':
CPnode19 PNODE <CPnode20, $T_equals, Cfg_SetGoodyType_Commit>
DD -1

; "text":
CPnode20 PNODE <CPnode26, $T_string, Cfg_AddInsrtTxt>

; '=':
CPnode21 PNODE <CPnode22, $T_equals, NULL>
DD -1

; "text":
CPnode22 PNODE <CPnode23, $T_string, Cfg_AddInsrtTxt>
DD -1

; "displayText":
CPnode23 PNODE <CPnode24, $T_displayText, NULL>
DD -1

; '=':
CPnode24 PNODE <CPnode25, $T_equals, Cfg_SetGoodyType_Commit>

; "text":
CPnode25 PNODE <CPnode26, $T_string, Cfg_AddDispTxt>
DD -1

; "[inserts]" / "[insert]" / EOF:
CPnode26 PNODE <CPnode1, $T_Inserts, Cfg_AddInsertsHdg>
PNODE <CPnode15, $T_Insert, Cfg_ClearCommit>
PNODE <CPnode99, $T_EOF, Cfg_CreateToolList>
DD -1

CPnode99 PNODE <CPnode0, 0, NULL>
DD -1


One implementation detail: notice how each series of nodes (some nodes have multiple PNODE structures, meaning that there's more than one possible branch that can be taken at that node depending on the tokens seen) is followed by a DD -1. This is very important: the -1 is a terminator for that node, so the dispatcher knows when it's reached the end of the list of nodes.

You can follow the path through the list by looking at the "next" element for that node, then going to that node and following it. Takes some getting used to, but you can see graphically how this scheme works.

Now we need a dispatcher to guide us through the sequence. Here it is:


PUSH EBX
LEA EBX, CfgParsingSequence
nxtokn: CALL GetNextCFGtoken
MOV token, EAX
CMP token, $T_error
JE tokenError

; If token is identifier, we need to get the specific token:
CMP token, $T_ID
JNE @F
CALL MatchID
MOV token, EAX

; Advance to next parsing node based on token seen:
@@: CMP EAX, [EBX].PNODE.PN_tokenID ;Match token to node entry.
JE pf20
ADD EBX, SIZE PNODE
CMP DWORD PTR [EBX], -1 ;End of this node's entries?
JNE @B ;  Nope, check next entry.

; Parsing error: no match for current token.
parseErr:
MOV result, $parseError_parse

exit99: POP EBX
rcf99: MOV EAX, result
RET

tokenError:
MOV result, $parseError_token
JMP exit99

; Match found--if there's a function to call, call it:
pf20: MOV EDX, [EBX].PNODE.PN_function
TEST EDX, EDX
JZ skipfn
MOV ECX, LargeTextBuffer ;Set ptr. to item text.
CALL EDX ;Call it.
skipfn: MOV EBX, [EBX].PNODE.PN_next ;Go to next node.
CMP token, $T_EOF
JNE nxtokn ;Keep going until end-o'file.

;*************************************
;********   Parse Successful  ********
;*************************************

MOV result, $T_success
JMP exit99


Once again, we use EBX as our pointer to the node list, starting at the head of the parsing sequence and advancing as we move to the next node. GetNextCFGtoken() gives us our tokens. The key statement here is this one:


CMP EAX, [EBX].PNODE.PN_tokenID ;Match token to node entry.


which is where we compare the token we got from GetNextCFGtoken() in EAX to the token that is expected at that node. If they match, all is well and we advance to the next token. Otherwise, we have a syntax error and have to call the whole thing off.

Matching IDs

Well, there's one little last detail to take care of here, and that's the matter of matching IDs. Look at the second "paragraph" in the dispatcher code above; notice how if the token returned is an ID ($T_ID) that we call the MatchID() function. That's because we know we have an ID (remember, a contiguous string of characters with no embedded spaces), but we don't know which identifier it is. Is it [goodies] or displayText? MatchID() resolves this for us by matching the identifier against a list of known IDs, then returning the specific identifier ID to the parser. I won't show that code here because there's nothing special about it; it's just a case-insensitive string comparator.

And there we have our parser. This is a fairly simple case of parsing, but this same technique can be used for much more complex applications. One could even construct a programming language using this type of parser. If you have a program that needs to interpret any kind of text, I urge you to consider using this method. You could adapt it to your own purposes, and who knows? you might even come up with some improvements over my way.
Assembly language programming should be fun. That's why I do it.

hutch--

It sounds like you are building a tree, balanced perhaps and have a set of rules for how to access that tree. What I have not grasped yet is how you parse the text source into units to place in the tree. I have a couple of very fast linear scanners of source text, the fastest being an an place tokeniser that loads an array of pointers on the single scan through the source text.

I think the data structure you are designing is a lot more powerful than a true parser which simply separates whatever the token you have in mind, words, lines of text, punctuation exclusions etc, it will be interesting to see what you end up doing with it.

NoCforMe

Hutch: it is not a tree. Dunno why everyone wants it to be that.

So far as being a true parser goes, I believe it is one. As I explained (in part 1, I think), there are actually two distinct phases of parsing, tokenizing and lexical analysis. The tokenizer I described does break the text up into basically bite-sized chunks that the parser can chew on, if that helps.

I thought I had explained things clearly, but I could be wrong about that. If anyone has any questions (or doubts) about this, please chime in here. Assuming that you've done a good job of explaining something is a good way to get into trouble.
Assembly language programming should be fun. That's why I do it.

hutch--

 :biggrin:

Hi David,

You are using a different definition of a parser, traditionally a parser does the chopping up of text into whatever design setting require. What you do with it after it has been parsed is another matter as it can vary by a significent amount.

I have attached a fast parser, the demo tokenses win64.inc, then it scans through the array to find the longest line, then allocates a fixed array then loads the lines in the file into the fixed array. It then loops through the array displaying each line to the console.

To run the demo, you will have to have win64.inc in the same directory and when you run it, the only slow part is the console display.

TimoVJL

Quote from: NoCforMe on September 18, 2022, 11:04:02 AM
it is not a tree. Dunno why everyone wants it to be that.
because someone can see your examples like this:[inserts]
[insert]
displayText="some text"
insertText="some other text"

[insert]
insertText="yet more text"
displayText = "yet more insert text"
May the source be with you

NoCforMe

Well, I guess that's true, and I never noticed it before. But it really has nothing to do with at least my method of tokenization and parsing. Not really helpful here. (I suppose one could come up with some kind of tree-based parsing system, but I can't think how that would work.)
Assembly language programming should be fun. That's why I do it.

mikeburr

i found this interesting..
if you are going to apply this to language you are going to make a very very basic structure .. companies such as google have invested vast amounts of resource into their language translator for example firstly to decompose the language .. then try and form an idea of the "meaning" of the content and "context" .. theyve prob got stock phrases as references too as you will doubtless have noted from teaching your children to speak that each language has some very weird sayings that we use regularly that wouldnt make much sense under rigorous analysis ..
if you are going to parse say mathematical symbols then unlike the grammar of a computing language the symbology can be very different within the same field .. you only have to look at a maths book thats about 100 years old and the symbols used will prob be quite different
i dont think you should in any way be deterred by this as its how people start making translators/compilers/calculators etc
regards mike b

NoCforMe

Thanks, Mike. However, my goals in using this parsing method are much more modest and mundane than, say, doing any kind of lexical analysis on actual (human) language, or pure mathematics. Though I'm sure my method could be applied to these fields. I simply use it as a way to parse files with a fairly simple syntax for the purposes of configuration or initialization.

I have one application where I actually parse a limited subset of assembly-language statements (containing DB and DW statements including symbolic constants like WM_xxxx and OR) which works very well. If I worked hard enough I suppose I could come up with a full-blown assembler someday. But that would be carrying coals to Newcastle, as they say.

Ackshooly, after thinking about it, if someone wanted to play with my method to parse something, I'd be quite willing to whip up a little testebed program that they could modify themselves. I've been using this method since the 1980s, so it's pretty well established in my bag of programming tricks.
Assembly language programming should be fun. That's why I do it.