Author Topic: Switch macro timings  (Read 2445 times)

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Switch macro timings
« on: April 24, 2016, 07:34:02 PM »
Testing Nidud's new .Switch macro and MasmBasic's version (of today, download here). Some timings on other processors would be nice. No inkey - run from a DOS prompt.

Note the tradeoff between size and speed for the MasmBasic "ret" and "jmp" versions; the latter produces code that is almost identical to AsmC's .Switch macro (I still have to find out where the 8 or so extra bytes come from  :( ). What is remarkable is that there is already a speed gain for more than 4 cases, compared to the Masm32 switch macro.

Note also (quote from \Masm32\MasmBasic\MbGuide.rtf but valid also for AsmC .Switch)
Code: [Select]
- avoid using Switch_ with few cases that are far apart, e.g. case -1000, case 1, case 1000, as this
  creates a huge jump table with many default entries; the Masm32 switch macro will be more size-efficient

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Assembled with AsmC
11 ms   case 260, MB Switch_ ret
8 ms    case 260, MB Switch_ jmp
370 ms  case 260, Masm32 switch
5 ms    case 260, AsmC .Switch

7 ms    case 196, MB Switch_ ret
5 ms    case 196, MB Switch_ jmp
286 ms  case 196, Masm32 switch
5 ms    case 196, AsmC .Switch

7 ms    case 132, MB Switch_ ret
5 ms    case 132, MB Switch_ jmp
189 ms  case 132, Masm32 switch
5 ms    case 132, AsmC .Switch

7 ms    case 68, MB Switch_ ret
5 ms    case 68, MB Switch_ jmp
98 ms   case 68, Masm32 switch
5 ms    case 68, AsmC .Switch

8 ms    case 4, MB Switch_ ret
5 ms    case 4, MB Switch_ jmp
10 ms   case 4, Masm32 switch
5 ms    case 4, AsmC .Switch

2999    bytes for MBSret
4211    bytes for MBSjmp
4799    bytes for Masm32
4200    bytes for asmc

MB Switch_ in jmp mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

MB Switch_ in ret mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

Source and exe attached; for building your own version, RichMasm works best, as it creates the AsmUsed$() macro (see above, "Assembled with AsmC"), which allows selective inclusion of Nidud's .Switch based on the assembler used. But the *.asc source opens in Wordpad, too.

P.S.: The macro assembles fine with everything from ML 6.15 .. 10, JWasm, HJWasm and AsmC; the ML versions build about half as fast, though, due to the excessive number of cases tested (300+).

Note also that there is cases=... in line 21; ML chokes above ca. 920 cases, AsmC seems not so impressed, but prepare for fat executables, e.g. for 10,000 cases:
Code: [Select]
109663  bytes for MBSret
149663  bytes for MBSjmp
179345  bytes for Masm32
149658  bytes for asmc
8)

sinsi

  • Member
  • ****
  • Posts: 996
Re: Switch macro timings
« Reply #1 on: April 24, 2016, 08:04:20 PM »
Code: [Select]
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Assembled with AsmC
5 ms    case 260, MB Switch_ ret
4 ms    case 260, MB Switch_ jmp
294 ms  case 260, Masm32 switch
4 ms    case 260, AsmC .Switch

5 ms    case 196, MB Switch_ ret
4 ms    case 196, MB Switch_ jmp
226 ms  case 196, Masm32 switch
3 ms    case 196, AsmC .Switch

5 ms    case 132, MB Switch_ ret
4 ms    case 132, MB Switch_ jmp
149 ms  case 132, Masm32 switch
4 ms    case 132, AsmC .Switch

6 ms    case 68, MB Switch_ ret
4 ms    case 68, MB Switch_ jmp
81 ms   case 68, Masm32 switch
3 ms    case 68, AsmC .Switch

6 ms    case 4, MB Switch_ ret
4 ms    case 4, MB Switch_ jmp
4 ms    case 4, Masm32 switch
3 ms    case 4, AsmC .Switch

2999    bytes for MBSret
4211    bytes for MBSjmp
4799    bytes for Masm32
4200    bytes for asmc

MB Switch_ in jmp mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

MB Switch_ in ret mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7
I can walk on water but stagger on beer.

sinsi

  • Member
  • ****
  • Posts: 996
Re: Switch macro timings
« Reply #2 on: April 24, 2016, 08:07:59 PM »
Code: [Select]
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
Assembled with AsmC
4 ms    case 260, MB Switch_ ret
4 ms    case 260, MB Switch_ jmp
313 ms  case 260, Masm32 switch
4 ms    case 260, AsmC .Switch

4 ms    case 196, MB Switch_ ret
3 ms    case 196, MB Switch_ jmp
218 ms  case 196, Masm32 switch
6 ms    case 196, AsmC .Switch

4 ms    case 132, MB Switch_ ret
3 ms    case 132, MB Switch_ jmp
147 ms  case 132, Masm32 switch
7 ms    case 132, AsmC .Switch

5 ms    case 68, MB Switch_ ret
4 ms    case 68, MB Switch_ jmp
77 ms   case 68, Masm32 switch
4 ms    case 68, AsmC .Switch

5 ms    case 4, MB Switch_ ret
6 ms    case 4, MB Switch_ jmp
6 ms    case 4, Masm32 switch
4 ms    case 4, AsmC .Switch

2999    bytes for MBSret
4211    bytes for MBSjmp
4799    bytes for Masm32
4200    bytes for asmc

MB Switch_ in jmp mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

MB Switch_ in ret mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7
I can walk on water but stagger on beer.

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Switch macro timings
« Reply #3 on: April 24, 2016, 08:38:17 PM »
Quote
Note the tradeoff between size and speed for the MasmBasic "ret" and "jmp" versions;

It’s a slick solution but impossible to implement given the stack is uneven at the case entry: .case 1: jmp toend or around the bend.
 
Quote
the latter produces code that is almost identical to AsmC's .Switch macro (I still have to find out where the 8 or so extra bytes come from  :( ).

The minimum and maximum values are handled internally by the assembler. The macro-switch seems to generate (8 byte of) code for this.

Quote
What is remarkable is that there is already a speed gain for more than 4 cases, compared to the Masm32 switch macro.

The optimised version has less overhang than before so that may be the reason. This will also explain why habrans version is faster: he uses a threshold of 4.

Quote
Note also (quote from \Masm32\MasmBasic\MbGuide.rtf but valid also for AsmC .Switch)
Quote
- avoid using Switch_ with few cases that are far apart, e.g. case -1000, case 1, case 1000, as this creates a huge jump table with many default entries; the Masm32 switch macro will be more size-efficient

The current implementation in ASMC will create a maximum table of 4 times the count of cases. A switch with 64 cases may then extend to a 256-size table. Anything above will be rendered as regular cmp statements.

The optimisation will then be to sort the values and skip min/max values that do not fit. You may have an array of 1 to 100 plus an odd value of 1G, so by skipping from min/max in the array you may find a usable table in-between.

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Switch macro timings
« Reply #4 on: April 24, 2016, 08:45:03 PM »
Code: [Select]
AMD Athlon(tm) II X2 245 Processor
Assembled with AsmC
7 ms case 260, MB Switch_ ret
6 ms case 260, MB Switch_ jmp
448 ms case 260, Masm32 switch
7 ms case 260, AsmC .Switch

7 ms case 196, MB Switch_ ret
6 ms case 196, MB Switch_ jmp
323 ms case 196, Masm32 switch
7 ms case 196, AsmC .Switch

7 ms case 132, MB Switch_ ret
6 ms case 132, MB Switch_ jmp
206 ms case 132, Masm32 switch
7 ms case 132, AsmC .Switch

8 ms case 68, MB Switch_ ret
7 ms case 68, MB Switch_ jmp
108 ms case 68, Masm32 switch
7 ms case 68, AsmC .Switch

8 ms case 4, MB Switch_ ret
7 ms case 4, MB Switch_ jmp
10 ms case 4, Masm32 switch
7 ms case 4, AsmC .Switch

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Re: Switch macro timings
« Reply #5 on: April 24, 2016, 10:26:21 PM »
It’s a slick solution but impossible to implement given the stack is uneven at the case entry

That could indeed be difficult in C, but the Switch_ macro works just fine, and size-wise it's up to 27% or so less, due to the retn instead of the jmp. I'll put a warning in the guide, though, that inside the Case the stack is 4 bytes off. Normally, it should be an issue only for exotic code that picks args manually from the stack.

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Switch macro timings
« Reply #6 on: April 24, 2016, 11:01:25 PM »
Code: [Select]
Switch_ eax
  Case_ 1
jmp toend
...
Endsw_
toend:
ret

Siekmanski

  • Member
  • *****
  • Posts: 1094
Re: Switch macro timings
« Reply #7 on: April 25, 2016, 03:58:00 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Assembled with AsmC
6 ms    case 260, MB Switch_ ret
4 ms    case 260, MB Switch_ jmp
334 ms  case 260, Masm32 switch
4 ms    case 260, AsmC .Switch

6 ms    case 196, MB Switch_ ret
4 ms    case 196, MB Switch_ jmp
252 ms  case 196, Masm32 switch
5 ms    case 196, AsmC .Switch

6 ms    case 132, MB Switch_ ret
4 ms    case 132, MB Switch_ jmp
169 ms  case 132, Masm32 switch
4 ms    case 132, AsmC .Switch

6 ms    case 68, MB Switch_ ret
5 ms    case 68, MB Switch_ jmp
89 ms   case 68, Masm32 switch
4 ms    case 68, AsmC .Switch

6 ms    case 4, MB Switch_ ret
5 ms    case 4, MB Switch_ jmp
4 ms    case 4, Masm32 switch
4 ms    case 4, AsmC .Switch

2999    bytes for MBSret
4211    bytes for MBSjmp
4799    bytes for Masm32
4200    bytes for asmc

MB Switch_ in jmp mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

MB Switch_ in ret mode:
ct=-2   Case -2
ct=-1   Default
ct=0    Default
ct=1    Case 1 .. 4
ct=2    ecx=2
ct=3    Case 1 .. 4
ct=4    edx=4
ct=5    Default
ct=6    var1=6
ct=7    Case 7

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Re: Switch macro timings
« Reply #8 on: April 25, 2016, 06:04:56 AM »
Code: [Select]
Switch_ eax
  Case_ 1
jmp toend
...
Endsw_
toend:
ret

Not sure what you want to demonstrate here... normally, you should never jmp out of a Switch.

Code: [Select]
Switch_ eax
  Case_ 1
; no jmp toend
...  ; will arrive after Endsw_ anyway
Endsw_

If you really want to "jump out", use this:
Code: [Select]
Switch_ eax
  Case_ 1
.if !somecondition
MsgBox "whatever", "title", MB_OK
.endif
Endsw_

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Switch macro timings
« Reply #9 on: April 25, 2016, 06:58:50 AM »
 :biggrin:

This is assembly: we jump around all the time  :P

A few snippets from Masm32:
Code: [Select]
Case WM_INITDIALOG ; ends with ret

Code: [Select]
          Case IDCANCEL
            jmp quit_dialog
        EndSw

      Case WM_CLOSE
        quit_dialog:
         invoke EndDialog,hWin,0
    EndSw

Code: [Select]
    StartLoop:
      invoke GetMessage,ADDR msg,NULL,0,0
      cmp eax, 0
      je ExitLoop

      Switch msg.message
        Case WM_KEYDOWN
          mov rval, FUNC(GetAsyncKeyState,VK_CONTROL)
          cmp WORD PTR rval[2], 1111111111111111b
          jne @F
            Switch msg.wParam
              Case VK_A
                invoke Select_All,hEdit
                invoke SendMessage,hEdit,WM_COPY,0,0
                jmp StartLoop

Here’s a "jump in" from MasmBasic:
Code: [Select]
Event Key
  cmp VKey, -VK_F1
  je @Open
  ...
  Case 1, TB0+1 ; 2nd toolbar button or 2nd menu entry
  @Open:

However what I actually meant was this:
Code: [Select]
mov ecx,4
Switch_ ecx, ret
  Case_ 1
mov eax,1
  Case_ 2
mov eax,2
  Case_ 3
mov eax,3
  Case_ 4
mov eax,4
Endsw_

The last case jumps to end.

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Re: Switch macro timings
« Reply #10 on: April 25, 2016, 08:03:17 AM »
Here’s a "jump in" from MasmBasic:
Code: [Select]
Event Key
  cmp VKey, -VK_F1
  je @Open
  ...
  Case 1, TB0+1 ; 2nd toolbar button or 2nd menu entry
  @Open:

No problem, stack is fine.

Quote
However what I actually meant was this:
Code: [Select]
mov ecx,4
Switch_ ecx, ret
  Case_ 1
mov eax,1
  Case_ 2
mov eax,2
  Case_ 3
mov eax,3
  Case_ 4
mov eax,4
Endsw_

The last case jumps to end.

GOOD CATCH! Corrected with version 25 April :t

Mikl__

  • Member
  • ****
  • Posts: 542
Re: Switch macro timings
« Reply #11 on: April 25, 2016, 12:01:06 PM »
Hi, jj2007!
Code: [Select]
Intel(R) Pentium(R) CPU G860 @ 3.00GHz
Assembled with AsmC
7 ms case 260, MB Switch_ ret
5 ms case 260, MB Switch_ jmp
410 ms case 260, Masm32 switch
5 ms case 260, AsmC .Switch

7 ms case 196, MB Switch_ ret
5 ms case 196, MB Switch_ jmp
301 ms case 196, Masm32 switch
5 ms case 196, AsmC .Switch

7 ms case 132, MB Switch_ ret
6 ms case 132, MB Switch_ jmp
204 ms case 132, Masm32 switch
5 ms case 132, AsmC .Switch

7 ms case 68, MB Switch_ ret
5 ms case 68, MB Switch_ jmp
109 ms case 68, Masm32 switch
6 ms case 68, AsmC .Switch

8 ms case 4, MB Switch_ ret
6 ms case 4, MB Switch_ jmp
11 ms case 4, Masm32 switch
5 ms case 4, AsmC .Switch

2999 bytes for MBSret
4211 bytes for MBSjmp
4799 bytes for Masm32
4200 bytes for asmc

MB Switch_ in jmp mode:
ct=-2 Case -2
ct=-1 Default
ct=0 Default
ct=1 Case 1 .. 4
ct=2 ecx=2
ct=3 Case 1 .. 4
ct=4 edx=4
ct=5 Default
ct=6 var1=6
ct=7 Case 7

MB Switch_ in ret mode:
ct=-2 Case -2
ct=-1 Default
ct=0 Default
ct=1 Case 1 .. 4
ct=2 ecx=2
ct=3 Case 1 .. 4
ct=4 edx=4
ct=5 Default
ct=6 var1=6
ct=7 Case 7

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Re: Switch macro timings
« Reply #12 on: April 25, 2016, 12:31:58 PM »
Thanks to all of you :icon14:

The problem is to find an application for jump table switches. My editor's help menu, for example, has over 30 entries, so that would be a candidate; but the number of entries can vary, so this is not an option.

WndProc?
Code: [Select]
WM_NULL                              equ 0h
WM_CREATE                            equ 1h
WM_DESTROY                           equ 2h
WM_MOVE                              equ 3h
WM_SIZE                              equ 5h
WM_ACTIVATE                          equ 6h

Looks like a great candidate, but...
Code: [Select]
WM_CUT                               equ 300h
WM_COPY                              equ 301h
WM_PASTE                             equ 302h

No good for a jump table ::)

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Switch macro timings
« Reply #13 on: April 26, 2016, 12:45:54 AM »
Quote
The problem is to find an application for jump table switches.

The logic from the test above:
Code: [Select]
5 ms case 260, MB Switch_ jmp
410 ms case 260, Masm32 switch

  5/260 = 0.019
410/260 = 1.577

I added the switch to ASMC in version 2.14 and at that point the elseif-count in the source base was around 658. Now there are 181 and the case-count is 477.
Code: [Select]
658 * 1.577 = 1038 ms
181 * 1.577 =  285 ms +
477 * 0.019 = 9 ms

old: 1038 ms
new:  294 ms

The theoretical calculation of the table-jump:
Code: [Select]
                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

Jx: jump          16   7+m   7+m    3             2
            no jump        4    3     3     1
        Jx  near-label     -    -    7+m    3             4
            no jump        -    -     3     1

        - It's a good programming practice to organize code so the
          expected case is executed without a jump since the actual
          jump takes longer to execute than falling through the test.
        - see   JCXZ  and  JMP  for their respective timings

Code: [Select]
        Unconditionally transfers control to "label".  Jumps by default
        are within -32768 to 32767 bytes from the instruction following
        the jump.  NEAR and SHORT jumps cause the IP to be updated while FAR
        jumps cause CS and IP to be updated.

                                                        Clocks
                   Operands                     808x  286    386   486

        rel8  (relative)                        15    7+m    7+m    3
        rel16 (relative)                        15    7+m    7+m    3
        rel32 (relative)                         -     -     7+m    3
        reg16 (near, register indirect)         11    7+m    7+m    5
        reg32 (near, register indirect)          -     -     7+m    5
        mem16 (near, mem indirect)             18+EA  11+m  10+m    5
        mem32 (near, mem indirect)             24+EA  15+m  10+m    5
        ptr16:32 (far, 6 byte immed)             -     -    12+m    13
        ptr16:32 (far, PM 6 byte immed)          -     -    27+m    18
        ptr16:32 (call gate, same priv.)         -     -    45+m    31
        ptr16:32 (via TSS)                       -     -     TS   42+TS
        ptr16:32 (via task state)                -     -     TS   43+TS
        m16:32 (far, address at dword)           -     -    43+m    13
        m16:32 (far, address at dword)           -     -    31+m    18
        m16:32 (call gate, same priv.)           -     -    49+m    31
        m16:32 (via TSS)                         -     -    5+TS  41+TS
        m16:32 (via task state)                  -     -    5+TS  42+TS

Let’s be kind and assign 3 to the conditional jump and 5 to the table-jump. The optimal code will then preset the assumed value and skip the default option to avoid this jump.
Code: [Select]
cmp reg,min ; 1
jl exit ; 1 : 3
cmp reg,max ; 1
jg exit ; 1 : 3
jmp [table+reg*4] ; 5++ (9++)
exit: ; 4/8 (min/max)

The "normal" timing will then be 4 to 8 clocks.
Code: [Select]
cmp reg,val_1 ; 1
je case_1 ; 1
cmp reg,val_2 ; 1
je case_2 ; 1 : 4
cmp reg,val_3 ; 1
je case_3 ; 1
cmp reg,val_4 ; 1
je case_4 ; 1 : 8
exit:

So the theoretical threshold for a jump-table is then a case-count above 4. The elseif-chain will create a jump for each test.

Old versus new version and elseif:
Code: [Select]
113655   cycles, rep(100), code(1446): table
352725   cycles, rep(100), code(1920): no table
525146   cycles, rep(100), code(1490): elseif-chain

The next step is to find the hidden table. The step-size is equal to 4 so you may have case 1, 4, 8, or 1, 2, 8 and so on.
Code: [Select]
.case 0
.case 1000 ; -> table
.case 1001
...
.case 1009 ; <-
.case 10000
.case 100000

The sample above currently works. That is, it creates a table from 1000 to 1009 and use cmp on the rest but it needs some fine-tuning and testing.

jj2007

  • Member
  • *****
  • Posts: 7557
  • Assembler is fun ;-)
    • MasmBasic
Re: Switch macro timings
« Reply #14 on: April 29, 2016, 09:37:58 PM »
This is a test with 2,000 cases:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Result MB:      0, 26 ms
Result Masm32:  0, 1795 ms
Result AsmC:    0, 29 ms

14084   bytes for swMB
27540   bytes for swMasm32
22111   bytes for swAsmC

But if you increase to 10,000 cases, the timings for Masm32 switch jump up dramatically:
Code: [Select]
Result MB:      0, 157 ms
Result Masm32:  0, 202547 ms
Result AsmC:    0, 185 ms

I wonder if AMD cpus behave differently...