News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

How to speed up a Tiny VM

Started by Ed Davis, February 13, 2015, 12:40:07 PM

Previous topic - Next topic

Ed Davis

I wrote a simple, integer only tiny C subset interpreter in C.

It generates code for a typical bytecode virtual stack machine. 20 or so instructions in the VM:

ifetch  - load the value of a variable onto the stack
istore  - store a value on the stack into a variable
ipush   - load a constant onto the stack
iadd, isub, imul, idiv, imod
        - pop the top two values, perform the indicated operation, and store the result back on the stack
ilt, igt, ile, ige, ieq, ine
        - pop the top two values, perform the indicated comparison, and store the boolean result back on the stack
ineg    - negate the top of stack value
_jz     - pop the stack, if the value is 0, jump
_jmp    - jump
iprt    - pop the top of stack and print the value
halt    - stop

I thought it would be fun to learn x64 assembly language by converting this to assembly.  I also thought it would be 20 to 30% faster, but it was only 10% faster in my benchmark.

So I guess my assembly language isn't so good.  Any thoughts on how I could improve the following code speed-wise?

Windows x64 VM byte code interpreter

Integers are 32-bit

There is a virtual 32-bit stack.  Variables are stored at the base of the stack.  The working stack starts at the end of the declared variables.

globally:

    rbx = base of stack
    rdi = current stack pointer
    rsi = program counter
    variables are stored at the base of the stack
    actual stack starts at spptr + data_size
    INTWIDTH = 4
    PTRWIDTH = 8


; dispatch table
        _dispatch_:
        .quad ifetch
        .quad istore
        .quad ipush
        .quad iadd

        ...

        ; init

        mov rdi, qword ptr spptr        ; stack pointer
        mov rbx, rdi                    ; save pointer to base of stack

        movsxd rax, dword ptr data_size
        lea rdi, [rdi + rax * INTWIDTH]

        mov rsi, qword ptr pcptr        ; program counter

        movsx eax, byte ptr [rsi]       ; get the first instruction byte
        jmp [_dispatch_ + eax * PTRWIDTH] ; jump to first instruction

; load the value of an integer variable onto the stack
; byte followed by int that is index into the stack for the variable to load
; CASE ifetch: *sp++ = stack[*(int *)pc]; pc += sizeof(int); NEXT;
        ifetch:
            inc rsi                     ; skip over the instruction

            movsxd rax, dword ptr [rsi] ; eax = *(int *)pc
            lea rax, [rbx + rax * INTWIDTH] ; ptr to stack[eax]
            mov eax, [rax]              ; value at ptr
            mov [rdi], eax              ; *sp = eax

            add rsi, INTWIDTH           ; skip over data
            add rdi, INTWIDTH           ; increment the stack

            movsx eax, byte ptr [rsi]   ; get the next instruction
            jmp [_dispatch_ + eax * PTRWIDTH]  ; and go there

; store an value into an integer variable
; byte followed by int that is ian ndex into the stack for the variable to store into from top of stack
; CASE istore: stack[*(int *)pc] = *--sp; pc += sizeof(int); NEXT;
        istore:
            inc rsi                     ; skip over the instruction

            movsxd rax, dword ptr [rsi] ; eax = *(int *)pc
            lea rcx, [rbx + rax * INTWIDTH] ; ptr to stack[eax]

            sub rdi, INTWIDTH
            mov eax, [rdi]
            mov [rcx], eax

            add rsi, INTWIDTH

            movsx eax, byte ptr [rsi]   ; get the next instruction
            jmp [_dispatch_ + eax * PTRWIDTH] ; and go there

; load an integer onto the stack
; byte followed by a constant integer, that is loaded onto the stack
; CASE ipush:  *sp++ = *(int *)pc; pc += sizeof(int); NEXT;
        ipush:
            inc rsi                     ; skip over the instruction

            mov eax, dword ptr [rsi]
            mov dword ptr [rdi], eax

            add rsi, INTWIDTH
            add rdi, INTWIDTH

            movsx eax, byte ptr [rsi]   ; get the next instruction
            jmp [_dispatch_ + eax * PTRWIDTH]    ; and go there

; add two values on the stack, replacing one of them
; byte
; CASE iadd:   sp[-2] += sp[-1]; --sp; NEXT;
        iadd:
            inc rsi                     ; skip over the instruction

            sub rdi, INTWIDTH
            mov eax, dword ptr [rdi]    ; B
            sub rdi, INTWIDTH

            add dword ptr [rdi], eax    ; A += B

            add rdi, INTWIDTH

            movsx eax, byte ptr [rsi]   ; get the next instruction
            jmp [_dispatch_ + eax * PTRWIDTH] ; and go there

            ....

Thanks for any tips!

rrr314159

Hello Ed, Welcome,

Re. speed, these are some thoughts that occur to me. May not help much; but FWIW,

You want to align all data to its "natural" boundary; i.e. dwords by 8 bytes, qwords 16 bytes, etc. So, for one thing, write "align 16" b4 the dispatch table.

You mix instructions (1 byte) and integers (4 bytes) on your stack, guaranteeing dwords are often unaligned. Instead, perhaps make instructions also dwords (wastes 3 bytes, of course, but that's probably OK) and make sure the stack starts aligned to 8. Then all dword mov's will be aligned.

You'll have to change your "inc rsi"'s to "add rsi, 4" but that's OK, add is faster than inc anyway. Even if you leave them at 1 byte "add rsi, 1" is faster.

Instead of
        sub rdi, INTWIDTH
        add dword ptr [rdi], eax
        add rdi, INTWIDTH
Why not just
        add dword ptr [rdi-INTWIDTH], eax
or, actually,
        add [rdi-INTWIDTH], eax

Instead of
        lea rax, [rbx + rax * INTWIDTH]
        mov eax, [rax]
you can
        mov eax, [rbx + rax * INTWIDTH]

Getting down to minor considerations, instead of
        sub rdi, INTWIDTH
        mov eax, [rdi]
        mov [rcx], eax
A little better for out-of-order execution is
        mov eax, [rdi-INTWIDTH]
        sub rdi, INTWIDTH
        mov [rcx], eax

(There's no speed penalty for address arithmetic.) This way the 1st two instructions are independent of each other (considering register aliasing). Also the 3rd, which depends on the 1st, has a slightly better chance of being pre-processed since there's one instruction between it and the determination of eax. Probably doesn't matter at all here, but such principles are important with more code; and I assume you're showing only a small subset of your whole program.

Another minor point (but then, these are all minor points): your rdi stack is increasing down (like any normal stack). But since this is only a virtual stack it could just as well increase upwards. That's better for reducing cache misses.

I haven't tested any of these recommendations, so if I've misunderstood your code or made other mistakes, sorry - it won't be the first time! Hope you can speed it up a little. As for your overall program logic, I see no way to improve that.
I am NaN ;)

Ed Davis

I don't have time for a proper reply - EST here, and 5:30 am comes way too soon  :( - but I just had to say - Thanks so much!

I applied just a few of your suggestions, and my benchmark time went from 27.59 seconds to 21.42 seconds.  Cool!

I hope to share some details tomorrow.

Thanks again!

Ed Davis

Currently, only 32-bit quantities are pushed on the virtual stack.  But the byte code does have mixed size bytes (the opcode) and 32-bit integers. I don't really want to change the byte code format for now, as I'd like to keep it somewhat compact.

From your suggestions, here are the things that helped to get the speedup on my benchmarks:

I aligned the data.

In ifetch, I changed:


            lea rax, [rbx + rax * 4]    ; ptr to stack[eax]
            mov eax, [rax]              ; value at ptr


to:


            mov eax, [rbx + rax * 4]

In iadd and isub, I changed (of course, add becomes sub for
isub):

[code]
            sub rdi, 4
            add dword ptr [rdi], eax  ; A += B
            add rdi, 4


to:


            add dword ptr [rdi - 4], eax


As noted, both of the above worked well.

Some things that didn't work as well:

I changed all 'skip over the opcode' from:


            inc rsi


to:

            add rsi, 1


But this slowed it down by about 2%.

In imul, I changed


            sub rdi, 4
            mov eax, [rdi]  ; A
            imul edx        ; A * B
            mov [rdi], eax
            add rdi, 4


to:


            mov eax, [rdi - 4]
            imul edx
            mov [rdi - 4], eax


But this was also slower, by 15%.

In ilt, I changed:


            sub rdi, 4
            cmp [rdi], ecx  ; A < B
            setl al
            mov [rdi], eax
            add rdi, 4


to:


            cmp [rdi - 4], ecx  ; A < B
            setl al
            mov [rdi - 4], eax


This one was like the imul change, a good bit slower.

I don't quite understand these last two (imul and ilt).  Two fewer instructions, so it seemed like it should be faster.

Perhaps you see some other improvements I could make in my comparison operator code (just showing ilt, of course there are all the standard comparison operators, but they are pretty similar) and my conditional jump code?


        ilt:
; compare the top two stack values, replace then with boolean result
; CASE ilt:    sp[-2] = sp[-2] <  sp[-1]; --sp; NEXT;
            inc rsi

            xor eax, eax

            sub rdi, 4
            mov ecx, [rdi]  ; B

            sub rdi, 4
            cmp [rdi], ecx  ; A < B
            setl al
            mov [rdi], eax
            add rdi, 4

            movsx eax, byte ptr [rsi]
            jmp [_dispatch_ + eax*8]


And here is my conditional jump code:


        _jz:
; jump if top of stack is 0; pop the stack.
; CASE jz:     pc += (*--sp == 0) ? *(int *)pc : (int)sizeof(int);  NEXT;
            inc rsi

            sub rdi, 4
            mov ecx, [rdi]
            jecxz jz_l1          ; if 0, take the jmp

            add rsi, 4
            jmp jz_l2

        jz_l1:
            movsxd rax, dword ptr [rsi]
            add rsi, rax

        jz_l2:

            movsx eax, byte ptr [rsi]
            jmp [_dispatch_ + eax*8]


Thanks again for the suggestions! 

rrr314159

#4
Hello Ed, given some of your surprising results, I did some testing myself ...

First, let me mention Agner Fog, the reigning expert in this area. Check out his "optimizing assembly" manual, and the others, "optimizing C++", "instruction tables", etc. He says (and everybody knows):

Quote"Optimizing Assembly": 16.2 INC and DEC
Use ADD and SUB when optimizing for speed. Use INC and DEC when optimizing for size ...

Depending on circumstances, he says add could be equal, or faster. I tested them and found no difference. You found add was slower: goes to show, your mileage may vary. This, however, is trivial.

This one is more interesting: "sub rdi, 4 / mov eax, [rdi], etc" vs. "mov eax, [rdi-4], etc". I found that the second is slower (as you did), but by less than 1%. Doesn't make sense, so I played with it a bit, found the following is also faster: "sub r11, 4 / mov eax, [rdi-4]"! Whether it's [rdi] or [rdi-4] made no difference at all; the tiny speed-up happens when I subtract first from a register - any register! I checked it various ways - consistent results. Don't know why; maybe someone else on the forum does.

So, if you can think of some useful instruction to put in place of "sub rdi, 4", do that instead and use "mov eax, [rdi-4]": you get the useful instruction for free! There's got to be a way, by re-arranging things, to make that a 15% increase instead of decrease.

I noticed your use of "jecxz". Well - everybody knows that's slower than "or ecx, ecx / jz". According to MASM32 help, it's 8/5 vs 4/2 (cycles for jump taken / not taken - on a 486). However this time I went to the trouble of verifying it, and found, actually, jecxz is a bit faster. Of course it depends what processor you're using, and other subtle details of your configuration ...

Let me reiterate: you should try to align your dwords to 4-bytes. Unaligned mov's are 33% slower, and you're doing quite a few. If you don't want to waste those 3 bytes per opcode, just keep it in mind - also, the use of [rdi-4] - and, as you develop the program, look for opportunities, via some re-arrangement, to take advantage of them. Perhaps you'll find a use for those extra 3 bytes?

I'm glad my other recommendations were fruitful. Alignment always helps. Changing to "add dword ptr [rdi - 4], eax" (eliminating the sub and add statements) obviously should be better, but, as we've seen, the obvious is not always true. You're doing the right thing - performance testing. As Ronald Reagan said, (re. USSR SALT treaty) "Trust, but Verify." By which he meant: DON'T trust, verify. :P

[edit] Figured out why this code was behaving strangely for me:
   
        sub rdi, 4
        mov eax, [rdi]  ; A
        imul edx        ; A * B
        mov [rdi], eax
        add rdi, 4

I'm testing it in a tight loop, and, evidently, moving to/from [rdi] too much - the memory read/writes "stall". All I have to do is put a few instructions between accesses of [rdi] and it behaves sensibly. But in your case it's not such a tight loop, u already have quite a few instructions between those mem accesses, so, dunno.

Also don't know why no one else chimes in; plenty of folks here with more experience than myself. Wait a while, perhaps some one will answer
I am NaN ;)