I wrote a simple, integer only tiny C subset interpreter in C.
It generates code for a typical bytecode virtual stack machine. 20 or so instructions in the VM:
ifetch - load the value of a variable onto the stack
istore - store a value on the stack into a variable
ipush - load a constant onto the stack
iadd, isub, imul, idiv, imod
- pop the top two values, perform the indicated operation, and store the result back on the stack
ilt, igt, ile, ige, ieq, ine
- pop the top two values, perform the indicated comparison, and store the boolean result back on the stack
ineg - negate the top of stack value
_jz - pop the stack, if the value is 0, jump
_jmp - jump
iprt - pop the top of stack and print the value
halt - stop
I thought it would be fun to learn x64 assembly language by converting this to assembly. I also thought it would be 20 to 30% faster, but it was only 10% faster in my benchmark.
So I guess my assembly language isn't so good. Any thoughts on how I could improve the following code speed-wise?
Windows x64 VM byte code interpreter
Integers are 32-bit
There is a virtual 32-bit stack. Variables are stored at the base of the stack. The working stack starts at the end of the declared variables.
globally:
rbx = base of stack
rdi = current stack pointer
rsi = program counter
variables are stored at the base of the stack
actual stack starts at spptr + data_size
INTWIDTH = 4
PTRWIDTH = 8
; dispatch table
_dispatch_:
.quad ifetch
.quad istore
.quad ipush
.quad iadd
...
; init
mov rdi, qword ptr spptr ; stack pointer
mov rbx, rdi ; save pointer to base of stack
movsxd rax, dword ptr data_size
lea rdi, [rdi + rax * INTWIDTH]
mov rsi, qword ptr pcptr ; program counter
movsx eax, byte ptr [rsi] ; get the first instruction byte
jmp [_dispatch_ + eax * PTRWIDTH] ; jump to first instruction
; load the value of an integer variable onto the stack
; byte followed by int that is index into the stack for the variable to load
; CASE ifetch: *sp++ = stack[*(int *)pc]; pc += sizeof(int); NEXT;
ifetch:
inc rsi ; skip over the instruction
movsxd rax, dword ptr [rsi] ; eax = *(int *)pc
lea rax, [rbx + rax * INTWIDTH] ; ptr to stack[eax]
mov eax, [rax] ; value at ptr
mov [rdi], eax ; *sp = eax
add rsi, INTWIDTH ; skip over data
add rdi, INTWIDTH ; increment the stack
movsx eax, byte ptr [rsi] ; get the next instruction
jmp [_dispatch_ + eax * PTRWIDTH] ; and go there
; store an value into an integer variable
; byte followed by int that is ian ndex into the stack for the variable to store into from top of stack
; CASE istore: stack[*(int *)pc] = *--sp; pc += sizeof(int); NEXT;
istore:
inc rsi ; skip over the instruction
movsxd rax, dword ptr [rsi] ; eax = *(int *)pc
lea rcx, [rbx + rax * INTWIDTH] ; ptr to stack[eax]
sub rdi, INTWIDTH
mov eax, [rdi]
mov [rcx], eax
add rsi, INTWIDTH
movsx eax, byte ptr [rsi] ; get the next instruction
jmp [_dispatch_ + eax * PTRWIDTH] ; and go there
; load an integer onto the stack
; byte followed by a constant integer, that is loaded onto the stack
; CASE ipush: *sp++ = *(int *)pc; pc += sizeof(int); NEXT;
ipush:
inc rsi ; skip over the instruction
mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
add rsi, INTWIDTH
add rdi, INTWIDTH
movsx eax, byte ptr [rsi] ; get the next instruction
jmp [_dispatch_ + eax * PTRWIDTH] ; and go there
; add two values on the stack, replacing one of them
; byte
; CASE iadd: sp[-2] += sp[-1]; --sp; NEXT;
iadd:
inc rsi ; skip over the instruction
sub rdi, INTWIDTH
mov eax, dword ptr [rdi] ; B
sub rdi, INTWIDTH
add dword ptr [rdi], eax ; A += B
add rdi, INTWIDTH
movsx eax, byte ptr [rsi] ; get the next instruction
jmp [_dispatch_ + eax * PTRWIDTH] ; and go there
....
Thanks for any tips!
Hello Ed, Welcome,
Re. speed, these are some thoughts that occur to me. May not help much; but FWIW,
You want to align all data to its "natural" boundary; i.e. dwords by 8 bytes, qwords 16 bytes, etc. So, for one thing, write "align 16" b4 the dispatch table.
You mix instructions (1 byte) and integers (4 bytes) on your stack, guaranteeing dwords are often unaligned. Instead, perhaps make instructions also dwords (wastes 3 bytes, of course, but that's probably OK) and make sure the stack starts aligned to 8. Then all dword mov's will be aligned.
You'll have to change your "inc rsi"'s to "add rsi, 4" but that's OK, add is faster than inc anyway. Even if you leave them at 1 byte "add rsi, 1" is faster.
Instead of
sub rdi, INTWIDTH
add dword ptr [rdi], eax
add rdi, INTWIDTH
Why not just
add dword ptr [rdi-INTWIDTH], eax
or, actually,
add [rdi-INTWIDTH], eax
Instead of
lea rax, [rbx + rax * INTWIDTH]
mov eax, [rax]
you can
mov eax, [rbx + rax * INTWIDTH]
Getting down to minor considerations, instead of
sub rdi, INTWIDTH
mov eax, [rdi]
mov [rcx], eax
A little better for out-of-order execution is
mov eax, [rdi-INTWIDTH]
sub rdi, INTWIDTH
mov [rcx], eax
(There's no speed penalty for address arithmetic.) This way the 1st two instructions are independent of each other (considering register aliasing). Also the 3rd, which depends on the 1st, has a slightly better chance of being pre-processed since there's one instruction between it and the determination of eax. Probably doesn't matter at all here, but such principles are important with more code; and I assume you're showing only a small subset of your whole program.
Another minor point (but then, these are all minor points): your rdi stack is increasing down (like any normal stack). But since this is only a virtual stack it could just as well increase upwards. That's better for reducing cache misses.
I haven't tested any of these recommendations, so if I've misunderstood your code or made other mistakes, sorry - it won't be the first time! Hope you can speed it up a little. As for your overall program logic, I see no way to improve that.
I don't have time for a proper reply - EST here, and 5:30 am comes way too soon :( - but I just had to say - Thanks so much!
I applied just a few of your suggestions, and my benchmark time went from 27.59 seconds to 21.42 seconds. Cool!
I hope to share some details tomorrow.
Thanks again!
Currently, only 32-bit quantities are pushed on the virtual stack. But the byte code does have mixed size bytes (the opcode) and 32-bit integers. I don't really want to change the byte code format for now, as I'd like to keep it somewhat compact.
From your suggestions, here are the things that helped to get the speedup on my benchmarks:
I aligned the data.
In ifetch, I changed:
lea rax, [rbx + rax * 4] ; ptr to stack[eax]
mov eax, [rax] ; value at ptr
to:
mov eax, [rbx + rax * 4]
In iadd and isub, I changed (of course, add becomes sub for
isub):
[code]
sub rdi, 4
add dword ptr [rdi], eax ; A += B
add rdi, 4
to:
add dword ptr [rdi - 4], eax
As noted, both of the above worked well.
Some things that didn't work as well:
I changed all 'skip over the opcode' from:
inc rsi
to:
add rsi, 1
But this slowed it down by about 2%.
In imul, I changed
sub rdi, 4
mov eax, [rdi] ; A
imul edx ; A * B
mov [rdi], eax
add rdi, 4
to:
mov eax, [rdi - 4]
imul edx
mov [rdi - 4], eax
But this was also slower, by 15%.
In ilt, I changed:
sub rdi, 4
cmp [rdi], ecx ; A < B
setl al
mov [rdi], eax
add rdi, 4
to:
cmp [rdi - 4], ecx ; A < B
setl al
mov [rdi - 4], eax
This one was like the imul change, a good bit slower.
I don't quite understand these last two (imul and ilt). Two fewer instructions, so it seemed like it should be faster.
Perhaps you see some other improvements I could make in my comparison operator code (just showing ilt, of course there are all the standard comparison operators, but they are pretty similar) and my conditional jump code?
ilt:
; compare the top two stack values, replace then with boolean result
; CASE ilt: sp[-2] = sp[-2] < sp[-1]; --sp; NEXT;
inc rsi
xor eax, eax
sub rdi, 4
mov ecx, [rdi] ; B
sub rdi, 4
cmp [rdi], ecx ; A < B
setl al
mov [rdi], eax
add rdi, 4
movsx eax, byte ptr [rsi]
jmp [_dispatch_ + eax*8]
And here is my conditional jump code:
_jz:
; jump if top of stack is 0; pop the stack.
; CASE jz: pc += (*--sp == 0) ? *(int *)pc : (int)sizeof(int); NEXT;
inc rsi
sub rdi, 4
mov ecx, [rdi]
jecxz jz_l1 ; if 0, take the jmp
add rsi, 4
jmp jz_l2
jz_l1:
movsxd rax, dword ptr [rsi]
add rsi, rax
jz_l2:
movsx eax, byte ptr [rsi]
jmp [_dispatch_ + eax*8]
Thanks again for the suggestions!
Hello Ed, given some of your surprising results, I did some testing myself ...
First, let me mention Agner Fog, the reigning expert in this area. Check out his "optimizing assembly" manual, and the others, "optimizing C++", "instruction tables", etc. He says (and everybody knows):
Quote"Optimizing Assembly": 16.2 INC and DEC
Use ADD and SUB when optimizing for speed. Use INC and DEC when optimizing for size ...
Depending on circumstances, he says add could be equal, or faster. I tested them and found no difference. You found add was slower: goes to show, your mileage may vary. This, however, is trivial.
This one is more interesting: "sub rdi, 4 / mov eax, [rdi], etc" vs. "mov eax, [rdi-4], etc". I found that the second is slower (as you did), but by less than 1%. Doesn't make sense, so I played with it a bit, found the following is also faster: "sub r11, 4 / mov eax, [rdi-4]"! Whether it's [rdi] or [rdi-4] made no difference at all; the tiny speed-up happens when I subtract first from a register - any register! I checked it various ways - consistent results. Don't know why; maybe someone else on the forum does.
So, if you can think of some useful instruction to put in place of "sub rdi, 4", do that instead and use "mov eax, [rdi-4]": you get the useful instruction for free! There's got to be a way, by re-arranging things, to make that a 15% increase instead of decrease.
I noticed your use of "jecxz". Well - everybody knows that's slower than "or ecx, ecx / jz". According to MASM32 help, it's 8/5 vs 4/2 (cycles for jump taken / not taken - on a 486). However this time I went to the trouble of verifying it, and found, actually, jecxz is a bit faster. Of course it depends what processor you're using, and other subtle details of your configuration ...
Let me reiterate: you should try to align your dwords to 4-bytes. Unaligned mov's are 33% slower, and you're doing quite a few. If you don't want to waste those 3 bytes per opcode, just keep it in mind - also, the use of [rdi-4] - and, as you develop the program, look for opportunities, via some re-arrangement, to take advantage of them. Perhaps you'll find a use for those extra 3 bytes?
I'm glad my other recommendations were fruitful. Alignment always helps. Changing to "add dword ptr [rdi - 4], eax" (eliminating the sub and add statements) obviously should be better, but, as we've seen, the obvious is not always true. You're doing the right thing - performance testing. As Ronald Reagan said, (re. USSR SALT treaty) "Trust, but Verify." By which he meant: DON'T trust, verify. :P
[edit] Figured out why this code was behaving strangely for me:
sub rdi, 4
mov eax, [rdi] ; A
imul edx ; A * B
mov [rdi], eax
add rdi, 4
I'm testing it in a tight loop, and, evidently, moving to/from [rdi] too much - the memory read/writes "stall". All I have to do is put a few instructions between accesses of [rdi] and it behaves sensibly. But in your case it's not such a tight loop, u already have quite a few instructions between those mem accesses, so, dunno.
Also don't know why no one else chimes in; plenty of folks here with more experience than myself. Wait a while, perhaps some one will answer