Benchmark testing different types of registers.

hutch-- · July 21, 2018, 11:40:19 PM

I have needed to test if there is any measurable difference when preserving and restoring registers using 3 types of registers, the general purpose integer type, MMX and XMM registers. The benchmark uses a very simple "rep movsb" copy algo that has identical mnemonics and have used the three different register types to save and restore RSI and RDI.

I am not getting any meaningful differences between the three on this Haswell I work on and get the usual wander in results, even with high priority settings, a CPUID serialising instruction and a timed yield using SleepEx().

These are typical results and they wander with each run of the test piece.

Warm up lap
copy1 1906 int regs
copy2 1922 xmm regs
copy3 1984 mmx regs

That's all folks ....

Siekmanski · July 22, 2018, 12:22:31 AM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

Warm up lap

copy1 1953 int regs
copy2 1922 xmm regs
copy3 1922 mmx regs

That's all folks ....

jj2007 · July 22, 2018, 01:18:17 AM

Code Select

  copy1 2043 int regs
  copy2 2059 xmm regs
  copy3 2059 mmx regs

I wanted to add a test saving them to [ebp+n] but it doesn't build: :(

Code Select

Microsoft (R) Macro Assembler (x64) Version 14.10.24930.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: bmcopy.asm
bmcopy.asm(16) : error A2008:syntax error : r14
bmcopy.asm(18) : error A2008:syntax error : SaveRegs
bmcopy.asm(20) : error A2008:syntax error : HighPriority
bmcopy.asm(107) : error A2008:syntax error : NormalPriority
bmcopy.asm(110) : error A2008:syntax error : RestoreRegs
POLINK: fatal error: File not found: 'bmcopy.obj'.

aw27 · July 22, 2018, 02:02:41 AM

Hi Hutch,
If you comment out the mem mov instructions, namely:
;mov rsi, rcx
;mov rdi, rdx
;mov rcx, r8
;rep movsb
and perform the tests you will obtain 0 millseconds or close to it. Almost all the time is taken by that. I think you need to perform the test differently.

hutch-- · July 22, 2018, 02:42:38 AM

JJ,

You need the latest macro file which is posted in the 64 bit MASM sub forum.

AW,

I put the identical copy code in all 3 to ensure there was a code action between the preserve and restore code as the very short tests skew the results due to their length.

RuiLoureiro · July 22, 2018, 08:27:39 AM

Hutch,
Sorry, i got "it is not possible to execute this app" in my PC. We need to wait for
my new i7... I would like to help.

aw27 · July 22, 2018, 06:36:50 PM

Hutch,
There is the Variance of the whole which absorbs the differences you want to detect.
For this case, I would prefer the qWord's x64 conversion of the Michael Webster's code timing macros -
although neither is not fully compliant with Agner Fog thoughts, namely in respect to alignment - and test only the relevant instructions.
This is not a brown-noser comment, take it or leave it. :t

jj2007 · July 22, 2018, 07:17:30 PM

I get build errors (error A2070:invalid instruction operands) for movq mm0, rsi
But it works for movd mm0, rsi, and copies a QWORD, actually. This is ML64 8)

Code Select

This code was assembled with ml64 in 64-bit format
Ticks for save_mmx: 2356
Ticks for save_xmm: 2324
Ticks for save_xmm_local: 2324
Ticks for save_local: 2325

The first two are "naked" procs, 3+4 have a stack frame; 3 saves rsi rdi rbx to xmm regs, 4 saves them to local qwords. This is for my Core i5. Example of loop design (@debug is an empty equate, align 16 for loops and procs):

Code Select

  mov r12, loops  ; 1,000,000,000
  mov r13, rv(GetTickCount)
  align 16
@@:	mov ecx, 123
	@debug
	call save_mmx
	dec r12
	jns @B
  jinvoke GetTickCount
  sub rax, r13
  Print Str$("Ticks for save_mmx: %i\n", rax)

Example mmx proc:

Code Select

0000000140001010   | 48 0F 6E C6                      | movd mm0,rsi                            |
0000000140001014   | 48 0F 6E CF                      | movd mm1,rdi                            |
0000000140001018   | 48 0F 6E D3                      | movd mm2,rbx                            |
000000014000101C   | 48 FF CE                         | dec rsi                                 |
000000014000101F   | 48 FF CF                         | dec rdi                                 |
0000000140001022   | 48 FF CB                         | dec rbx                                 |
0000000140001025   | 48 0F 7E C6                      | movd rsi,mm0                            |
0000000140001029   | 48 0F 7E CF                      | movd rdi,mm1                            |
000000014000102D   | 48 0F 7E D3                      | movd rbx,mm2                            |
0000000140001031   | C3                               | ret                                     |

hutch-- · July 22, 2018, 10:16:39 PM

JJ,

We must live on different planets, this is how I build the test piece and this is the disassembly of the proc that uses the XMM registers to preserve RSI and RDI.

Microsoft (R) Macro Assembler (x64) Version 14.10.24930.0
Copyright (C) Microsoft Corporation. All rights reserved.

Assembling: bmcopy.asm
Volume in drive K is disk3_k
Volume Serial Number is 68C7-4DBB

Directory of K:\asm64\copytest\copybm

21/07/2018 11:28 PM 2,958 bmcopy.asm
22/07/2018 10:12 PM 3,072 bmcopy.exe
22/07/2018 10:12 PM 4,234 bmcopy.obj
21/07/2018 11:30 PM 3,949 bmcopy.zip
4 File(s) 14,213 bytes
0 Dir(s) 724,183,744,512 bytes free
Press any key to continue . . .

sub_1400012d6 proc
.text:00000001400012d6 66480F6EC6 movq xmm0, rsi
.text:00000001400012db 66480F6ECF movq xmm1, rdi
.text:00000001400012e0 488BF1 mov rsi, rcx
.text:00000001400012e3 488BFA mov rdi, rdx
.text:00000001400012e6 498BC8 mov rcx, r8
.text:00000001400012e9 F3A4 rep movsb byte ptr [rdi], byte ptr [rsi]
.text:00000001400012eb 66480F7EC6 movq rsi, xmm0
.text:00000001400012f0 66480F7ECF movq rdi, xmm1
.text:00000001400012f5 C3 ret
sub_1400012d6 endp

hutch-- · July 22, 2018, 10:45:11 PM

One of the tests I did was the following, write the regs to the stack and for the little its worth, it did not time with much variation from the others. I would not recommend it for production code but it works OK.

Warm up lap
copy1 1906 int regs
copy2 1891 xmm regs
copy3 1907 mmx regs
copy4 1907 manual stack

That's all folks ....

The algo.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

copy4 proc

sub rsp, 32

mov [rsp-8], rsi
mov [rsp-16], rdi

mov rsi, rcx
mov rdi, rdx
mov rcx, r8
rep movsb

mov rsi, [rsp-8]
mov rdi, [rsp-16]

add rsp, 32

ret

copy4 endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007 · July 22, 2018, 11:10:36 PM

Quote from: hutch-- on July 22, 2018, 10:16:39 PMWe must live on different planets

Same planet, different source and library. I have a loop that calls a proc one Billion times, the proc is just the bare minimum, no rep movsb etc.

Quote from: hutch-- on July 22, 2018, 10:45:11 PM
One of the tests I did was the following, write the regs to the stack and for the little its worth, it did not time with much variation from the others.

Thanks for confirming my results above. So using mmx, xmm or just local variables produces roughly the same results.

nidud · July 22, 2018, 11:15:42 PM

deleted

nidud · July 22, 2018, 11:25:22 PM

deleted

hutch-- · July 23, 2018, 12:26:39 AM

I have deliberately avoided the short tests without runtime code in it as it avoids some obvious cache effects from too small instruction counts. Even though the rep movsb code is short, running it with a reasonable amount of work stretches the duration between the save and restore and as each test proc has the identical rep movsb code, what I was testing was the difference if any with different methods of preserving RSI and RDI.

From what I have tested and the tests done by others with the original test piece, the differences are negligible with any of the register methods and the stack version but it tells me that you can use any of the methods as they time much the same.

aw27 · July 23, 2018, 01:14:55 AM

The short tests are made to use the L1 cache, so are appopriate for small pieces of code. This is the reason Agner Fog code aligns on CACHELINESIZE, usually 64 bytes, to prevent sharing with other threads.
MichaelW and qWord alignment skip that detail.

The MASM Forum

News:

Benchmark testing different types of registers.

hutch--

Siekmanski

jj2007

aw27

hutch--

RuiLoureiro

aw27

jj2007

hutch--

hutch--

jj2007

nidud

nidud

hutch--

aw27