Parsing, part 2

NoCforMe · September 18, 2022, 05:21:54 AM

In this episode I'm going to flesh out the tokenizer a little bit more, dive into some of its details. At this point, still no computer: we're just using paper and pencil. Don't worry, the code will come soon enough.

The image below shows just one branch of the tokenizer, the part that handles quoted text. It's actually one of the simpler parts, just two states and two extra nodes. Following the sequence through from the start node (0), when we see an opening double quote character in the text stream, we move to node D. Like most paths through the tokenizer, this initial node does whatever setup is needed to start storing characters; we'll see later that all this needs to do is set a pointer to a character buffer. This node then immediately goes to state 1, where the character is stored. As more "any" characters are seen, they get stored as well. (Remember that "any" here means "any characters which don't match any of the other characters in the tokenizer"; these are the characters that drop through the "sieve" in the tokenizer code. And in this case, the "ALL" at the end means "anything except a double quote character".)

There's a really nice little feature of this string tokenizer: by adding just a single state, state 3, we can allow embedded double quotes in text, just like in assembler strings:

Code Select


SomeString	DB "This string has ""quotes"" inside it.", 0

You can prove to yourself by following the sequence of such a string through this tokenizer fragment. (On paper, no computer allowed!) The state bounces back and forth between 2 and 3 when a series of 2 double quotes is seen, with the quotes being embedded in the string. See how simple that is? No IF ... ELSEIF ..., no setting a dozen flags, just a simple path through our little machine.

You've probably noticed the note that says "push back character" (PBC) and wondered just what that is. Here's the problem: let's say we're tokenizing a quoted string with the above sequence, and right after the string is the end of the file (meaning there's no carriage return/line feed pair after the closing quote). This is an unusual situation, but not an illegal one so far as our tokenizer is concerned. But if we just return the string at this point, what happens to that EOF character? (The end-of-file indicator is actually a character that I put at the end of the file text buffer.) It'll get thrown away, with the result that we'll never know that we hit the end of the file.

So what we do is very simple: we "push back" the character. This is done by the tiny little subroutine that delivers characters from the file text buffer to the tokenizer, called GNC() (Get Next Character). It places that character, EOF in this case, in a 1-byte buffer, and the next time GNC() is called it delivers this pushed-back character instead of the next one from the buffer. Again, works like a charm. That way, the next time through the tokenizer, it'll deliver the EOF character, and the parser will know that it's done with the tokenization phase.

We use PBC anytime we need to preserve the character just seen before returning a token, so that character will be seen first thing when the tokenizer is next called. Keep in mind that the tokenizer is repeatedly called by the parser to deliver a token, so we always start from state 0, but from a different place in the text stream.

OK, that's enough paperwork for us; we're now ready to start writing some code. That'll be for the next installment.

The MASM Forum

News:

Parsing, part 2

NoCforMe