I have needed to test if there is any measurable difference when preserving and restoring registers using 3 types of registers, the general purpose integer type, MMX and XMM registers. The benchmark uses a very simple "rep movsb" copy algo that has identical mnemonics and have used the three different register types to save and restore RSI and RDI.
I am not getting any meaningful differences between the three on this Haswell I work on and get the usual wander in results, even with high priority settings, a CPUID serialising instruction and a timed yield using SleepEx().
These are typical results and they wander with each run of the test piece.
Warm up lap
copy1 1906 int regs
copy2 1922 xmm regs
copy3 1984 mmx regs
That's all folks ....
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Warm up lap
copy1 1953 int regs
copy2 1922 xmm regs
copy3 1922 mmx regs
That's all folks ....
copy1 2043 int regs
copy2 2059 xmm regs
copy3 2059 mmx regs
I wanted to add a test saving them to [ebp+n] but it doesn't build: :(
Microsoft (R) Macro Assembler (x64) Version 14.10.24930.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: bmcopy.asm
bmcopy.asm(16) : error A2008:syntax error : r14
bmcopy.asm(18) : error A2008:syntax error : SaveRegs
bmcopy.asm(20) : error A2008:syntax error : HighPriority
bmcopy.asm(107) : error A2008:syntax error : NormalPriority
bmcopy.asm(110) : error A2008:syntax error : RestoreRegs
POLINK: fatal error: File not found: 'bmcopy.obj'.
Hi Hutch,
If you comment out the mem mov instructions, namely:
;mov rsi, rcx
;mov rdi, rdx
;mov rcx, r8
;rep movsb
and perform the tests you will obtain 0 millseconds or close to it. Almost all the time is taken by that. I think you need to perform the test differently.
JJ,
You need the latest macro file which is posted in the 64 bit MASM sub forum.
AW,
I put the identical copy code in all 3 to ensure there was a code action between the preserve and restore code as the very short tests skew the results due to their length.
:biggrin:
Hutch,
Sorry, i got "it is not possible to execute this app" in my PC. We need to wait for
my new i7... I would like to help.
Hutch,
There is the Variance of the whole which absorbs the differences you want to detect.
For this case, I would prefer the qWord's x64 conversion of the Michael Webster's code timing macros -
although neither is not fully compliant with Agner Fog thoughts, namely in respect to alignment - and test only the relevant instructions.
This is not a brown-noser comment, take it or leave it. :t
I get build errors (error A2070:invalid instruction operands) for movq mm0, rsi
But it works for movd mm0, rsi, and copies a QWORD, actually. This is ML64 8)
This code was assembled with ml64 in 64-bit format
Ticks for save_mmx: 2356
Ticks for save_xmm: 2324
Ticks for save_xmm_local: 2324
Ticks for save_local: 2325
The first two are "naked" procs, 3+4 have a stack frame; 3 saves rsi rdi rbx to xmm regs, 4 saves them to local qwords. This is for my Core i5. Example of loop design (@debug is an empty equate, align 16 for loops and procs): mov r12, loops ; 1,000,000,000
mov r13, rv(GetTickCount)
align 16
@@: mov ecx, 123
@debug
call save_mmx
dec r12
jns @B
jinvoke GetTickCount
sub rax, r13
Print Str$("Ticks for save_mmx: %i\n", rax)
Example mmx proc:0000000140001010 | 48 0F 6E C6 | movd mm0,rsi |
0000000140001014 | 48 0F 6E CF | movd mm1,rdi |
0000000140001018 | 48 0F 6E D3 | movd mm2,rbx |
000000014000101C | 48 FF CE | dec rsi |
000000014000101F | 48 FF CF | dec rdi |
0000000140001022 | 48 FF CB | dec rbx |
0000000140001025 | 48 0F 7E C6 | movd rsi,mm0 |
0000000140001029 | 48 0F 7E CF | movd rdi,mm1 |
000000014000102D | 48 0F 7E D3 | movd rbx,mm2 |
0000000140001031 | C3 | ret |
JJ,
We must live on different planets, this is how I build the test piece and this is the disassembly of the proc that uses the XMM registers to preserve RSI and RDI.
Microsoft (R) Macro Assembler (x64) Version 14.10.24930.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: bmcopy.asm
Volume in drive K is disk3_k
Volume Serial Number is 68C7-4DBB
Directory of K:\asm64\copytest\copybm
21/07/2018 11:28 PM 2,958 bmcopy.asm
22/07/2018 10:12 PM 3,072 bmcopy.exe
22/07/2018 10:12 PM 4,234 bmcopy.obj
21/07/2018 11:30 PM 3,949 bmcopy.zip
4 File(s) 14,213 bytes
0 Dir(s) 724,183,744,512 bytes free
Press any key to continue . . .
sub_1400012d6 proc
.text:00000001400012d6 66480F6EC6 movq xmm0, rsi
.text:00000001400012db 66480F6ECF movq xmm1, rdi
.text:00000001400012e0 488BF1 mov rsi, rcx
.text:00000001400012e3 488BFA mov rdi, rdx
.text:00000001400012e6 498BC8 mov rcx, r8
.text:00000001400012e9 F3A4 rep movsb byte ptr [rdi], byte ptr [rsi]
.text:00000001400012eb 66480F7EC6 movq rsi, xmm0
.text:00000001400012f0 66480F7ECF movq rdi, xmm1
.text:00000001400012f5 C3 ret
sub_1400012d6 endp
One of the tests I did was the following, write the regs to the stack and for the little its worth, it did not time with much variation from the others. I would not recommend it for production code but it works OK.
Warm up lap
copy1 1906 int regs
copy2 1891 xmm regs
copy3 1907 mmx regs
copy4 1907 manual stack
That's all folks ....
The algo.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
copy4 proc
sub rsp, 32
mov [rsp-8], rsi
mov [rsp-16], rdi
mov rsi, rcx
mov rdi, rdx
mov rcx, r8
rep movsb
mov rsi, [rsp-8]
mov rdi, [rsp-16]
add rsp, 32
ret
copy4 endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Quote from: hutch-- on July 22, 2018, 10:16:39 PMWe must live on different planets
Same planet, different source and library. I have a loop that calls a proc one Billion times, the proc is just the bare minimum, no rep movsb etc.
Quote from: hutch-- on July 22, 2018, 10:45:11 PM
One of the tests I did was the following, write the regs to the stack and for the little its worth, it did not time with much variation from the others.
Thanks for confirming my results above. So using mmx, xmm or just local variables produces roughly the same results.
deleted
deleted
I have deliberately avoided the short tests without runtime code in it as it avoids some obvious cache effects from too small instruction counts. Even though the rep movsb code is short, running it with a reasonable amount of work stretches the duration between the save and restore and as each test proc has the identical rep movsb code, what I was testing was the difference if any with different methods of preserving RSI and RDI.
From what I have tested and the tests done by others with the original test piece, the differences are negligible with any of the register methods and the stack version but it tells me that you can use any of the methods as they time much the same.
The short tests are made to use the L1 cache, so are appopriate for small pieces of code. This is the reason Agner Fog code aligns on CACHELINESIZE, usually 64 bytes, to prevent sharing with other threads.
MichaelW and qWord alignment skip that detail.
My complaint with Agner Fogs method (the last version I saw) was that it only ran in ring0 which does not suffer the normal interference from task switching and ring0 privilege over normal ring3 operation. The only testing method I support is real time testing in ring3 and the longer the test in terms of duration, the more reliable it gets.
All tests run in ring 3, you don't need the kernel driver if are not going to collect sorts of information, like performance counter or cache hit/misses, that require changing CR4 or MSR in order to be performed in ring 3.
I have done the tests using only the relevant instructions. I had to modify qWord's code because it did not work due to a confusion on saving to wrong places (I had never used it before).
The results per iteration are:
Quote
Warm up
int regs: 4 cycles
xmm: 4 cycles
mmx regs: 4 cycles
That's all folks ....
There is absolutely no difference over 100 million iterations on my i7 8700K.
I tested in 3 other computers and values are much higher, but again no significant difference between them.
Now, a different conclusion. :bgrin:
In the previous code I have included in the timing code the loop instructions. This is wrong, but were I going to exclude the time taken by them the result would be ZERO, or even negative, most of the times. This appears weird, but is indeed true.
I tried, without success, different ways of making things work, even clearing the caches using CLFLUSH, but no joy. The set of instructions I wanted to control always took ZERO cycles or less - all the time was spent on the loop instructions.
Now, I am using a different approach.
I will repeat the set of instructions under control control a certain number of times in each loop iteration (established as 10 in the test). I will also discount the time taken by the looping instructions.
And, things are splitting up like in a chemical decomposition!
* 10 repeats per iteration. *
Warm up
int regs: 8 cycles
xmm: 30 cycles
mmx regs: 30 cycles
xchg regs: 31 cycles
push/pop regs: 38 cycles
stack 37 cycles
That's all folks ....
:biggrin:
You have just articulated the problem with calculated rather than timed instruction testing, I have seen heaps of results that turn up 0 or negative results, both of which are impossible. Real time testing has far fewer problems, if the duration of the test is too short you can get a zero due to timing granularity but if you make the test long enough you never get 0 or negative results. If I am trying to time things with very little variation I use a CPUID serialised instruction to clear the cache then use the interrupt based SleepEx() to synchronise a timeslice exit as well as altering the process priority to reduce OS interference.
Interesting. Seems to be consistent except for the first test run ( int regs )
* 10 repeats per iteration. *
Warm up
int regs: 3 cycles
xmm: 14 cycles
mmx regs: 14 cycles
xchg regs: 34 cycles
push/pop regs: 52 cycles
stack 52 cycles
That's all folks ....
* 10 repeats per iteration. *
Warm up
int regs: 8 cycles
xmm: 14 cycles
mmx regs: 14 cycles
xchg regs: 35 cycles
push/pop regs: 53 cycles
stack 53 cycles
That's all folks ....
* 10 repeats per iteration. *
Warm up
int regs: 7 cycles
xmm: 14 cycles
mmx regs: 14 cycles
xchg regs: 34 cycles
push/pop regs: 52 cycles
stack 52 cycles
That's all folks ....
* 10 repeats per iteration. *
Warm up
int regs: 4 cycles
xmm: 14 cycles
mmx regs: 14 cycles
xchg regs: 34 cycles
push/pop regs: 52 cycles
stack 52 cycles
That's all folks ....
deleted
The explanation may be hidden somewhere in the Intel and AMD microarchitecture manuals, lots of optimization things happen behind the scenes. One thing I believe, it is becoming increasingly difficult to time properly small pieces of code. :(
@nidud :biggrin:
We can make 64-bit aligned code sections without all that artillery.
Something like:
_TEXT64 segment alias (".code") 'CODE' align(64)
entry_point proc
sub rsp, 28h
align 64
; do stuff
mov rcx,0
call ExitProcess
entry_point endp
_TEXT64 ends
end
I did a quick test piece just looping the 3 combinations of register preservation and found that the test conditions changed the results in humerous ways. Increasing the priority improved the XMM version, commenting in and out the cache clearing CPUID seemed to favour the MMX register version and if you turned off increasing priority, CPUID and SleepEx that the integer version was faster. All have the same problem of cache saturation as the tests are too short to be useful, even with a very high iteration count.
Since the PIII the action has been instruction scheduling and narrow instruction testing is not far off useless. Sequencing instructions through multiple pipelines without stalls is far more useful when chasing speed as long as you stay away from very old instructions that only live in microcode and really slow the works up. The problem with narrow instruction testing is the assumption that processors work like a pre i486 instruction chomper and that world is long gone.
For the very little that its worth, this is the experimental test piiece.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r13, r14, r15
LOCAL ireg :QWORD
LOCAL mreg :QWORD
LOCAL xreg :QWORD
mov ireg, 0
mov mreg, 0
mov xreg, 0
SaveRegs
; HighPriority
mov r13, 8
lbl:
; ------------------------------------
mov r15, 1000000000
; rcall SleepEx, 100,0
; cpuid
call GetTickCount
mov r14, rax
@@:
call intreg
sub r15, 1
jnz @B
call GetTickCount
sub rax, r14
add ireg, rax
conout str$(rax)," intreg",lf
; ------------------------------------
mov r15, 1000000000
; rcall SleepEx, 100,0
; cpuid
call GetTickCount
mov r14, rax
@@:
call mmxreg
sub r15, 1
jnz @B
call GetTickCount
sub rax, r14
add mreg, rax
conout str$(rax)," mmxreg",lf
; ------------------------------------
mov r15, 1000000000
; rcall SleepEx, 100,0
; cpuid
call GetTickCount
mov r14, rax
@@:
call xmmreg
sub r15, 1
jnz @B
call GetTickCount
sub rax, r14
add xreg, rax
conout str$(rax)," xmmreg",lf
; ------------------------------------
sub r13, 1
jnz lbl
shr ireg, 3
conout " INT Reg Average ",str$(ireg),lf
shr mreg, 3
conout " MMX Reg Average ",str$(mreg),lf
shr xreg, 3
conout " XMM Reg Average ",str$(xreg),lf
NormalPriority
waitkey
RestoreRegs
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
intreg proc
mov r11, rsi
mov r10, rdi
mov rsi, r11
mov rdi, r10
ret
intreg endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
mmxreg proc
movq mm0, rsi
movq mm1, rdi
movq rsi, mm0
movq rdi, mm1
ret
mmxreg endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
xmmreg proc
movq xmm0, rsi
movq xmm1, rdi
movq rsi, xmm0
movq rdi, xmm1
ret
xmmreg endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
I added an int64 and a "mytest proc uses rsi rdi rbx" test case, and an empty loop:
This code was assembled with ml64 in 64-bit format
Ticks for save_int: 2324
Ticks for save_mmx: 2309
Ticks for save_xmm: 2309
Ticks for save_xmm_local: 2979
Ticks for save_local: 2309
Ticks for save_uses: 2652
Ticks for loop without call: 328
What I find astonishing is that the "memory cases" perform roughly like the three register cases. This in spite of the fact that they need more instructions:
save_int:
0000000140001010 | 4C 8B E6 | mov r12,rsi |
0000000140001013 | 4C 8B EF | mov r13,rdi |
0000000140001016 | 4C 8B F3 | mov r14,rbx |
0000000140001019 | 48 FF CE | dec rsi |
000000014000101C | 48 FF CF | dec rdi |
000000014000101F | 48 FF CB | dec rbx |
0000000140001022 | 49 8B F4 | mov rsi,r12 |
0000000140001025 | 49 8B FD | mov rdi,r13 |
0000000140001028 | 49 8B DE | mov rbx,r14 |
000000014000102B | C3 | ret |
save_local:
0000000140001090 | 55 | push rbp |
0000000140001091 | 48 8B EC | mov rbp,rsp |
0000000140001094 | 48 81 EC A0 00 00 00 | sub rsp,A0 |
000000014000109B | 48 89 75 E8 | mov qword ptr ss:[rbp-18],rsi |
000000014000109F | 48 89 7D E0 | mov qword ptr ss:[rbp-20],rdi |
00000001400010A3 | 48 89 5D D8 | mov qword ptr ss:[rbp-28],rbx |
00000001400010A7 | 48 FF CE | dec rsi |
00000001400010AA | 48 FF CF | dec rdi |
00000001400010AD | 48 FF CB | dec rbx |
00000001400010B0 | 48 8B 75 E8 | mov rsi,qword ptr ss:[rbp-18] |
00000001400010B4 | 48 8B 7D E0 | mov rdi,qword ptr ss:[rbp-20] |
00000001400010B8 | 48 8B 5D D8 | mov rbx,qword ptr ss:[rbp-28] |
00000001400010BC | C9 | leave |
00000001400010BD | C3 | ret |
The only explanation that I have for this is that
a) creating a stack frame has almost zero cost
b) in a tight loop, writing and reading the L1 cache is as fast as using a register
Re b), see SOF: Cache or Registers - which is faster? (https://stackoverflow.com/questions/14504734/cache-or-registers-which-is-faster) - and the answer was "registers should be faster"; but that is over 5 years ago 8)
@jj2007 - this reminded me of a conclusion reached in a journal paper, by Majo & Gross regarding NUMA caches
a) creating a stack frame has almost zero cost
b) in a tight loop, writing and reading the L1 cache is as fast as using a register
A) --> almost zero - any ideas as to why?
B) --> which does align(pun) with the academic sources stating that access to L1 cache typically approaches zero wait-state
Thanks JJ for the hard work you always put in - much appreciated.
On this Haswell I am using, registers are clearly faster and the machine has fast memory.
1953 7 registers
2328 7 memory operands
2000 7 registers
2328 7 memory operands
2000 7 registers
2329 7 memory operands
1953 7 registers
2343 7 memory operands
1922 7 registers
2375 7 memory operands
2000 7 registers
2375 7 memory operands
1953 7 registers
2313 7 memory operands
1875 7 registers
2343 7 memory operands
Results
1957 Register average
2341 memory operand average
Press any key to continue...
The test piece.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
call testproc
waitkey
invoke ExitProcess,0
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
iterations equ <1000000000>
testproc proc
LOCAL mem1 :QWORD
LOCAL mem2 :QWORD
LOCAL mem3 :QWORD
LOCAL mem4 :QWORD
LOCAL mem5 :QWORD
LOCAL mem6 :QWORD
LOCAL mem7 :QWORD
LOCAL cnt1 :QWORD
LOCAL cnt2 :QWORD
LOCAL time :QWORD
LOCAL rslt1 :QWORD
LOCAL rslt2 :QWORD
USING rsi, rdi, rbx, r12, r13, r14, r15
SaveRegs
HighPriority
mov rslt1, 0
mov rslt2, 0
mov cnt2, 8
loopstart:
; ------------------------------------
cpuid
call GetTickCount
mov time, rax
mov cnt1, iterations
@@:
mov rsi, 1
mov rdi, 2
mov rbx, 3
mov r12, 4
mov r13, 5
mov r14, 6
mov r15, 7
sub cnt1, 1
jnz @B
call GetTickCount
sub rax, time
add rslt1, rax
conout " ",str$(rax)," 7 registers",lf
; ------------------------------------
cpuid
call GetTickCount
mov time, rax
mov cnt1, iterations
@@:
mov mem1, 1
mov mem2, 2
mov mem3, 3
mov mem4, 4
mov mem5, 5
mov mem6, 6
mov mem7, 7
sub cnt1, 1
jnz @B
call GetTickCount
sub rax, time
add rslt2, rax
conout " ",str$(rax)," 7 memory operands",lf
; ------------------------------------
sub cnt2, 1
jnz loopstart
shr rslt1, 3
shr rslt2, 3
conout lf," Results",lf,lf
conout " ",str$(rslt1)," Register average",lf
conout " ",str$(rslt2)," memory operand average",lf,lf
NormalPriority
RestoreRegs
ret
testproc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
1841 7 registers
3401 7 memory operands
1763 7 registers
3354 7 memory operands
1762 7 registers
3401 7 memory operands
1701 7 registers
3354 7 memory operands
1763 7 registers
3354 7 memory operands
1762 7 registers
3354 7 memory operands
1763 7 registers
3339 7 memory operands
1825 7 registers
3354 7 memory operands
Results
1772 Register average
3363 memory operand average
AMD A6-9220e 1.60 Ghz
1625 7 registers
1813 7 memory operands
1515 7 registers
1781 7 memory operands
1625 7 registers
1797 7 memory operands
1625 7 registers
1782 7 memory operands
1625 7 registers
1781 7 memory operands
1625 7 registers
1781 7 memory operands
1625 7 registers
1781 7 memory operands
1625 7 registers
1797 7 memory operands
Results
1611 Register average
1789 memory operand average
Press any key to continue...
1127 7 registers
2168 7 memory operands
1113 7 registers
2169 7 memory operands
1113 7 registers
2173 7 memory operands
1113 7 registers
2168 7 memory operands
1112 7 registers
2168 7 memory operands
1113 7 registers
2169 7 memory operands
1113 7 registers
2168 7 memory operands
1114 7 registers
2168 7 memory operands
Results
1114 Register average
2168 memory operand average
Press any key to continue...
Hmmmm... I am getting similar results with NOTHING and a single NOP... for hutchs last posting here
Back to the drawing board....
2418 nothing
2714 1 single nop
2481 nothing
2589 1 single nop
2528 nothing
2823 1 single nop
2153 nothing
2293 1 single nop
2075 nothing
2231 1 single nop
2168 nothing
2418 1 single nop
2496 nothing
2699 1 single nop
2746 nothing
3074 1 single nop
Results
2383 nothing average
2605 single nop average
Press any key to continue...
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm64\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
call testproc
waitkey
invoke ExitProcess,0
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
iterations equ <1000000000>
testproc proc
LOCAL mem1 :QWORD
LOCAL mem2 :QWORD
LOCAL mem3 :QWORD
LOCAL mem4 :QWORD
LOCAL mem5 :QWORD
LOCAL mem6 :QWORD
LOCAL mem7 :QWORD
LOCAL cnt1 :QWORD
LOCAL cnt2 :QWORD
LOCAL time :QWORD
LOCAL rslt1 :QWORD
LOCAL rslt2 :QWORD
USING rsi, rdi, rbx, r12, r13, r14, r15
SaveRegs
HighPriority
mov rslt1, 0
mov rslt2, 0
mov cnt2, 8
loopstart:
; ------------------------------------
cpuid
call GetTickCount
mov time, rax
mov cnt1, iterations
@@:
sub cnt1, 1
jnz @B
call GetTickCount
sub rax, time
add rslt1, rax
conout " ",str$(rax)," nothing",lf
; ------------------------------------
cpuid
call GetTickCount
mov time, rax
mov cnt1, iterations
@@:
nop
sub cnt1, 1
jnz @B
call GetTickCount
sub rax, time
add rslt2, rax
conout " ",str$(rax)," 1 single nop",lf
; ------------------------------------
sub cnt2, 1
jnz loopstart
shr rslt1, 3
shr rslt2, 3
conout lf," Results",lf,lf
conout " ",str$(rslt1)," nothing average",lf
conout " ",str$(rslt2)," single nop average",lf,lf
NormalPriority
RestoreRegs
ret
testproc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Seems the timing code is taking up most of the time, or the reg/mem moves are as fast as nothing or a single nop.
later I tried 100 nop's and the 100 nop's were faster than nothing. No extra alignment, same code otherwise.