News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Benchmark testing different types of registers.

Started by hutch--, July 21, 2018, 11:40:19 PM

Previous topic - Next topic

hutch--

My complaint with Agner Fogs method (the last version I saw) was that it only ran in ring0 which does not suffer the normal interference from task switching and ring0 privilege over normal ring3 operation. The only testing method I support is real time testing in ring3 and the longer the test in terms of duration, the more reliable it gets.

aw27

All tests run in ring 3, you don't need the kernel driver if are not going to collect sorts of information, like performance counter or cache hit/misses,  that require changing CR4 or MSR in order to be performed in ring 3.

aw27

I have done the tests using only the relevant instructions. I had to modify qWord's code because it did not work due to a confusion on saving  to wrong places (I had never used it before).
The results per iteration are:

Quote
  Warm up
  int regs: 4 cycles
  xmm: 4 cycles
  mmx regs: 4 cycles
  That's all folks ....

There is absolutely no difference over 100 million iterations on my i7 8700K.

I tested in 3 other computers and values are much higher, but again no significant difference between them.

aw27

Now, a different conclusion.   :bgrin:

In the previous code I have included in the timing code the loop instructions. This is wrong, but were I going to exclude the time taken by them the result would be ZERO, or even negative, most of the times. This appears weird, but is indeed true.
I tried, without success, different ways of making things work, even clearing the caches using CLFLUSH, but no joy. The set of instructions I wanted to control always took ZERO cycles or less - all the time was spent on the loop instructions.

Now, I am using a different approach.
I will repeat the set of instructions under control control a certain number of times in each loop iteration (established as 10 in the test). I will also discount the time taken by the looping instructions.
And, things are splitting up like in a chemical decomposition!


  * 10 repeats per iteration. *
  Warm up
  int regs: 8 cycles
  xmm: 30 cycles
  mmx regs: 30 cycles
  xchg regs: 31 cycles
  push/pop regs: 38 cycles
  stack 37 cycles
  That's all folks ....



hutch--

 :biggrin:

You have just articulated the problem with calculated rather than timed instruction testing, I have seen heaps of results that turn up 0 or negative results, both of which are impossible. Real time testing has far fewer problems, if the duration of the test is too short you can get a zero due to timing granularity but if you make the test long enough you never get 0 or negative results. If I am trying to time things with very little variation I use a CPUID serialised instruction to clear the cache then use the interrupt based SleepEx() to synchronise a timeslice exit as well as altering the process priority to reduce OS interference.

Siekmanski

Interesting. Seems to be consistent except for the first test run ( int regs )

* 10 repeats per iteration. *
  Warm up
  int regs: 3 cycles
  xmm: 14 cycles
  mmx regs: 14 cycles
  xchg regs: 34 cycles
  push/pop regs: 52 cycles
  stack 52 cycles
  That's all folks ....

  * 10 repeats per iteration. *
  Warm up
  int regs: 8 cycles
  xmm: 14 cycles
  mmx regs: 14 cycles
  xchg regs: 35 cycles
  push/pop regs: 53 cycles
  stack 53 cycles
  That's all folks ....

  * 10 repeats per iteration. *
  Warm up
  int regs: 7 cycles
  xmm: 14 cycles
  mmx regs: 14 cycles
  xchg regs: 34 cycles
  push/pop regs: 52 cycles
  stack 52 cycles
  That's all folks ....

  * 10 repeats per iteration. *
  Warm up
  int regs: 4 cycles
  xmm: 14 cycles
  mmx regs: 14 cycles
  xchg regs: 34 cycles
  push/pop regs: 52 cycles
  stack 52 cycles
  That's all folks ....

Creative coders use backward thinking techniques as a strategy.

nidud

#21
deleted

aw27

The explanation may be hidden somewhere in the Intel and AMD microarchitecture manuals, lots of optimization things happen behind the scenes. One thing I believe, it is becoming increasingly difficult to time properly small pieces of code.  :(

@nidud  :biggrin:
We can make 64-bit aligned code sections without all that artillery.

Something like:
_TEXT64 segment alias (".code") 'CODE'  align(64)
   entry_point proc
      sub rsp, 28h
      align 64

      ; do stuff
      mov rcx,0
      call ExitProcess
   entry_point endp   
_TEXT64 ends

end

hutch--

I did a quick test piece just looping the 3 combinations of register preservation and found that the test conditions changed the results in humerous ways. Increasing the priority improved the XMM version, commenting in and out the cache clearing CPUID seemed to favour the MMX register version and if you turned off increasing priority, CPUID and SleepEx that the integer version was faster. All have the same problem of cache saturation as the tests are too short to be useful, even with a very high iteration count.

Since the PIII the action has been instruction scheduling and narrow instruction testing is not far off useless. Sequencing instructions through multiple pipelines without stalls is far more useful when chasing speed as long as you stay away from very old instructions that only live in microcode and really slow the works up. The problem with narrow instruction testing is the assumption that processors work like a pre i486 instruction chomper and that world is long gone.

For the very little that its worth, this is the experimental test piiece.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r13, r14, r15

    LOCAL ireg  :QWORD
    LOCAL mreg  :QWORD
    LOCAL xreg  :QWORD

    mov ireg, 0
    mov mreg, 0
    mov xreg, 0

    SaveRegs

    ; HighPriority

    mov r13, 8

  lbl:

  ; ------------------------------------

    mov r15, 1000000000
    ; rcall SleepEx, 100,0
    ; cpuid

    call GetTickCount
    mov r14, rax

  @@:
    call intreg
    sub r15, 1
    jnz @B

    call GetTickCount
    sub rax, r14
    add ireg, rax

    conout str$(rax)," intreg",lf

  ; ------------------------------------

    mov r15, 1000000000
    ; rcall SleepEx, 100,0
    ; cpuid

    call GetTickCount
    mov r14, rax

  @@:
    call mmxreg
    sub r15, 1
    jnz @B

    call GetTickCount
    sub rax, r14
    add mreg, rax

    conout str$(rax)," mmxreg",lf

  ; ------------------------------------

    mov r15, 1000000000
    ; rcall SleepEx, 100,0
    ; cpuid

    call GetTickCount
    mov r14, rax

  @@:
    call xmmreg
    sub r15, 1
    jnz @B

    call GetTickCount
    sub rax, r14
    add xreg, rax

    conout str$(rax)," xmmreg",lf

  ; ------------------------------------

    sub r13, 1
    jnz lbl

    shr ireg, 3
    conout " INT Reg Average ",str$(ireg),lf

    shr mreg, 3
    conout " MMX Reg Average ",str$(mreg),lf

    shr xreg, 3
    conout " XMM Reg Average ",str$(xreg),lf

    NormalPriority

    waitkey

    RestoreRegs

    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

intreg proc

    mov r11, rsi
    mov r10, rdi

    mov rsi, r11
    mov rdi, r10

    ret

intreg endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

mmxreg proc

    movq mm0, rsi
    movq mm1, rdi

    movq rsi, mm0
    movq rdi, mm1

    ret

mmxreg endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

xmmreg proc

    movq xmm0, rsi
    movq xmm1, rdi

    movq rsi, xmm0
    movq rdi, xmm1

    ret

xmmreg endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

jj2007

I added an int64 and a "mytest proc uses rsi rdi rbx" test case, and an empty loop:
This code was assembled with ml64 in 64-bit format
Ticks for save_int: 2324
Ticks for save_mmx: 2309
Ticks for save_xmm: 2309
Ticks for save_xmm_local: 2979
Ticks for save_local: 2309
Ticks for save_uses: 2652
Ticks for loop without call: 328


What I find astonishing is that the "memory cases" perform roughly like the three register cases. This in spite of the fact that they need more instructions:

save_int:
0000000140001010   | 4C 8B E6                         | mov r12,rsi                             |
0000000140001013   | 4C 8B EF                         | mov r13,rdi                             |
0000000140001016   | 4C 8B F3                         | mov r14,rbx                             |
0000000140001019   | 48 FF CE                         | dec rsi                                 |
000000014000101C   | 48 FF CF                         | dec rdi                                 |
000000014000101F   | 48 FF CB                         | dec rbx                                 |
0000000140001022   | 49 8B F4                         | mov rsi,r12                             |
0000000140001025   | 49 8B FD                         | mov rdi,r13                             |
0000000140001028   | 49 8B DE                         | mov rbx,r14                             |
000000014000102B   | C3                               | ret                                     |


save_local:
0000000140001090   | 55                               | push rbp                                |
0000000140001091   | 48 8B EC                         | mov rbp,rsp                             |
0000000140001094   | 48 81 EC A0 00 00 00             | sub rsp,A0                              |
000000014000109B   | 48 89 75 E8                      | mov qword ptr ss:[rbp-18],rsi           |
000000014000109F   | 48 89 7D E0                      | mov qword ptr ss:[rbp-20],rdi           |
00000001400010A3   | 48 89 5D D8                      | mov qword ptr ss:[rbp-28],rbx           |
00000001400010A7   | 48 FF CE                         | dec rsi                                 |
00000001400010AA   | 48 FF CF                         | dec rdi                                 |
00000001400010AD   | 48 FF CB                         | dec rbx                                 |
00000001400010B0   | 48 8B 75 E8                      | mov rsi,qword ptr ss:[rbp-18]           |
00000001400010B4   | 48 8B 7D E0                      | mov rdi,qword ptr ss:[rbp-20]           |
00000001400010B8   | 48 8B 5D D8                      | mov rbx,qword ptr ss:[rbp-28]           |
00000001400010BC   | C9                               | leave                                   |
00000001400010BD   | C3                               | ret                                     |


The only explanation that I have for this is that

a) creating a stack frame has almost zero cost
b) in a tight loop, writing and reading the L1 cache is as fast as using a register

Re b), see SOF: Cache or Registers - which is faster? - and the answer was "registers should be faster"; but that is over 5 years ago 8)

Raistlin

#25
@jj2007 - this reminded me of a conclusion reached in a journal paper, by Majo & Gross regarding NUMA caches

a) creating a stack frame has almost zero cost
b) in a tight loop, writing and reading the L1 cache is as fast as using a register


A) --> almost zero - any ideas as to why?
B) --> which does align(pun) with the academic sources stating that access to L1 cache typically approaches zero wait-state

Thanks JJ for the hard work you always put in - much appreciated.
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

hutch--

On this Haswell I am using, registers are clearly faster and the machine has fast memory.

  1953 7 registers
  2328 7 memory operands
  2000 7 registers
  2328 7 memory operands
  2000 7 registers
  2329 7 memory operands
  1953 7 registers
  2343 7 memory operands
  1922 7 registers
  2375 7 memory operands
  2000 7 registers
  2375 7 memory operands
  1953 7 registers
  2313 7 memory operands
  1875 7 registers
  2343 7 memory operands

  Results

  1957 Register average
  2341 memory operand average

Press any key to continue...


The test piece.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    call testproc

    waitkey

    invoke ExitProcess,0

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    iterations equ <1000000000>

testproc proc

    LOCAL mem1  :QWORD
    LOCAL mem2  :QWORD
    LOCAL mem3  :QWORD
    LOCAL mem4  :QWORD
    LOCAL mem5  :QWORD
    LOCAL mem6  :QWORD
    LOCAL mem7  :QWORD
    LOCAL cnt1  :QWORD
    LOCAL cnt2  :QWORD
    LOCAL time  :QWORD

    LOCAL rslt1 :QWORD
    LOCAL rslt2 :QWORD

    USING rsi, rdi, rbx, r12, r13, r14, r15

    SaveRegs

    HighPriority

    mov rslt1, 0
    mov rslt2, 0

    mov cnt2, 8

  loopstart:

  ; ------------------------------------

    cpuid

    call GetTickCount
    mov time, rax

    mov cnt1, iterations
  @@:
    mov rsi, 1
    mov rdi, 2
    mov rbx, 3
    mov r12, 4
    mov r13, 5
    mov r14, 6
    mov r15, 7
    sub cnt1, 1
    jnz @B

    call GetTickCount
    sub rax, time
    add rslt1, rax
    conout "  ",str$(rax)," 7 registers",lf

  ; ------------------------------------

    cpuid

    call GetTickCount
    mov time, rax

    mov cnt1, iterations
  @@:
    mov mem1, 1
    mov mem2, 2
    mov mem3, 3
    mov mem4, 4
    mov mem5, 5
    mov mem6, 6
    mov mem7, 7
    sub cnt1, 1
    jnz @B

    call GetTickCount
    sub rax, time
    add rslt2, rax
    conout "  ",str$(rax)," 7 memory operands",lf

  ; ------------------------------------

    sub cnt2, 1
    jnz loopstart

    shr rslt1, 3
    shr rslt2, 3

    conout lf,"  Results",lf,lf

    conout "  ",str$(rslt1)," Register average",lf
    conout "  ",str$(rslt2)," memory operand average",lf,lf

    NormalPriority

    RestoreRegs

    ret

testproc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

zedd151

  1841 7 registers
  3401 7 memory operands
  1763 7 registers
  3354 7 memory operands
  1762 7 registers
  3401 7 memory operands
  1701 7 registers
  3354 7 memory operands
  1763 7 registers
  3354 7 memory operands
  1762 7 registers
  3354 7 memory operands
  1763 7 registers
  3339 7 memory operands
  1825 7 registers
  3354 7 memory operands
  Results
  1772 Register average
  3363 memory operand average

AMD A6-9220e 1.60 Ghz

JoeBr

  1625 7 registers
  1813 7 memory operands
  1515 7 registers
  1781 7 memory operands
  1625 7 registers
  1797 7 memory operands
  1625 7 registers
  1782 7 memory operands
  1625 7 registers
  1781 7 memory operands
  1625 7 registers
  1781 7 memory operands
  1625 7 registers
  1781 7 memory operands
  1625 7 registers
  1797 7 memory operands

  Results

  1611 Register average
  1789 memory operand average

Press any key to continue...

mineiro

  1127 7 registers
  2168 7 memory operands
  1113 7 registers
  2169 7 memory operands
  1113 7 registers
  2173 7 memory operands
  1113 7 registers
  2168 7 memory operands
  1112 7 registers
  2168 7 memory operands
  1113 7 registers
  2169 7 memory operands
  1113 7 registers
  2168 7 memory operands
  1114 7 registers
  2168 7 memory operands

  Results

  1114 Register average
  2168 memory operand average

Press any key to continue...
I'd rather be this ambulant metamorphosis than to have that old opinion about everything