Question about an Assembler Parser

satpro · January 24, 2013, 01:39:17 PM

I am working on an x86 command-line cross-assembler for the 65816 that will output files for Vice, the c64 emulator. They have emulation for the SuperCPU now but the current crowd of assemblers are buggy, have strange non-standard syntax, or are just plain hard to work with. So, why not? I have a rough "storyboard" written for most of what it should do and a basic idea of how to do it, but won't actually know until I give it a go. I suppose it sounds like a very ambitious project but I am determined to do this. There are some things I haven't quite worked out yet and before taking the leap I thought I'd ask around a little first.

A nagging question is about how parsing source is generally approached (if there is a general approach). I'm guessing the program should read the entire main source file into memory and parse it there as opposed to file-parsing it line by line for speed reasons, feeling it's much faster to parse source from RAM than from a file. Does that make sense? If that sounds right--would the assembler just read the first file with ReadFile and then would the additional source files be parsed in RAM or from disk line by line? Is there some better way someone here knows about?

Any general suggestions will certainly be helpful. Thanks...

Tedd · January 25, 2013, 03:06:57 AM

If you know the files won't be large then you can get away with reading them straight into memory; another possibility is memory-mapping the file.
From there, you can process it line-by-line, lex and parse, and then output instructions.
Included files could easily be done recursively with the same function (you call assemble_it for the main file, which calls itself for an included file.)

satpro · January 25, 2013, 03:39:31 AM

Thanks for the reply, Tedd. What you said (file mapping) is pretty much the way I'm leaning right now. Never coded an assembler before. At least I'm having fun with it...

jj2007 · January 25, 2013, 05:57:13 AM

Quote from: satpro on January 24, 2013, 01:39:17 PM
A nagging question is about how parsing source is generally approached (if there is a general approach). I'm guessing the program should read the entire main source file into memory and parse it there as opposed to file-parsing it line by line for speed reasons, feeling it's much faster to parse source from RAM than from a file.

Speedwise, it's difficult to beat Recall, and parsing is a lot easier if you can do it line by line as follows:

include \masm32\MasmBasic\MasmBasic.inc ; download
Init

Recall "\Masm32\include\Windows.inc", Src$() ; shuffle 22,000 strings into an array
For_ ebx=0 To eax-1
.if Instr_(Src$(ebx), "add", 5) ; 1=case-insensitive, 4=whole word; try 1 instead of 5
Print Str$("\nLine %i: ", ebx), Src$(ebx)
.endif
Next

Inkey CrLf$, "ok"
Exit
end start

The Masm32 tokeniser, ltok, does roughly the same and looks more like "real" assembler, if you prefer.

In any case, google for lexer parser to get some ideas.

hutch-- · January 25, 2013, 08:42:51 AM

satpro,

I would be inclined to have a good read of the available info on compiler design as it may save you a lot of experimental work in determining which design to eventually use. Parsers in assembler work fine if a bit fussy to write but you need to have a very good idea of how you want to parse the source. Now one of our members has posted info from time to time about a "recursive descent parser" and this is a capacity that you may need if you wish to parse higher level nested functions. Then you need to make decisions on whether you will support code split across multiple lines as most assemblers/compilers will do.

satpro · January 25, 2013, 10:32:25 AM

hutch (and jj2007),

Thank you both for the replies. As soon as I read jj's reply I hit the 'net and looked up lexer/parser. At first it seemed exciting--tons of sites with info on the subject. Trouble is, I'm not a C/Java/etc. programmer. For me, it's been literally assembler since the late 70s. So immediately I felt overwhelmed. Granted, I had to pick up some C to understand most of Windows, but I certainly cannot program much past "Hello World" in those languages. I'm still going to do this, but it might be writing the parser using lookup tables (or something along those lines that I'm comfortable with). Before getting excited about Google, that was the direction I was headed. It will be a month of Sundays before I understand the HLL code that's out there for those parsing engines. It could be longer than that.

I guess I was hoping there might be someone reading who had done this sort of thing in assembly before and who might be willing to throw a tip my way as to how they did it. Lots of talent around this place and I'm wide-open for ideas from anyone.

So, for now I'll keep reading source code from some of the assemblers that are out there, maybe adapt some 6502/816 native assembler source, and work on the code I have now (which is reading line by line from a source file loaded into RAM and, so far, roughly seems to work), and see what happens. Eventually it will fall into place. Regardless, it will never need to be as complicated as something like MASM--it's destined for the c64. Macros, nested include files, structures, and conditional assembly are about all the fancy it needs for now. I am reading about compiler design, though, and that is already helping with the organizational aspect.

Thanks again to both of you.

hutch-- · January 25, 2013, 01:28:45 PM

In the masm32 library there are two procedures that do part of what you want, the "ltok" proc is an in place memory line tokeniser, there is another proc that tokenises words but it is here that you start to get into the complicated stuff as you need a lot more parsing power than space delimited words. I have in the past played with running the line tokeniser first, scan the line ends to see if it is a split line, join any split lines while keeping track of the start line number so that you can track errors.

Then for each line once you have joined split lines you multipass it to unravel any nested functions writing them out as separate lines. Recursion is the normal technique but I have also used an iterative approach until each line has nothing else to unravel.

If you build a simple MASM program and run a full listing, you will see in part how they lay out the data for the memory image they use.

Adamanteus · January 26, 2013, 09:48:36 AM

I could remark, that full functional parser (mean tokenizer by the concept) is quite difficult scientific task, so best way is to follow good practice in this problem. Lime by line and recursive methods solves only time to time problems ...

The MASM Forum

News:

Question about an Assembler Parser

satpro

Tedd

satpro

jj2007

hutch--

satpro

hutch--

Adamanteus