Is there any reason why the MMX registers can't be used for 8 additional general purpose registers. ?
start:
push ebx
push ecx
push edx
; Check feature flag 23 in EDX for MMX support
mov eax, 1
cpuid
mov eax, edx
shr eax, 23
and eax, 1
; Restore registers
pop edx
pop ecx
pop ebx
; not sure if these are well suited for intergers
; xor eax,eax ; clear rax
; movss XMM0,value1
mov eax,10
movd MM0,eax ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
; pxor MM1,MM1 ; zero out MM1
; EMMS ; really slow, around 50 cycles
; Use if:
; You call any library routines or OS APIs (that might possibly use the FPU).
; You switch tasks in a cooperative fashion
; You execute any FPU instructions.
invoke ExitProcess,ecx
Quote from: Magnum on March 02, 2013, 04:16:22 AM
Is there any reason why the MMX registers can't be used for 8 additional general purpose registers. ?
You can use MMX registers in any way they can be used, provided you don't use FPU code.
They are only aliased on FPU registers, not registers [hardware based] themselves,
as GPRs are.
1 reason - old machine without mmx\sse (http://smiles.kolobok.us/artists/viannen/viannen_111.gif)
the advantage of general registers is speed
swapping between MMX/FPU registers probably doesn't support that advantage
you might want to run a few timing tests to see if it isn't just as fast to use a memory location
If you can live with 0.3 cycles less speed, or without the FPU, why not? But no flags...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
176 cycles for 100 * mov
167 cycles for 100 * movd mm
188 cycles for 100 * movd xmm
176 cycles for 100 * mov
168 cycles for 100 * movd mm
190 cycles for 100 * movd xmm
178 cycles for 100 * mov
167 cycles for 100 * movd mm
188 cycles for 100 * movd xmm
2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm
Hmmm...
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles
60 cycles for 100 * mov
1079 cycles for 100 * movd mm
951 cycles for 100 * movd xmm
59 cycles for 100 * mov
1082 cycles for 100 * movd mm
967 cycles for 100 * movd xmm
62 cycles for 100 * mov
1077 cycles for 100 * movd mm
953 cycles for 100 * movd xmm
Its supposed to be "politically incorrect" to mix MMX and integer registers but I have never had any problems doing it. If you like living dangerously with all 8 32 bit registers, you can plonk ESP into an MMX register and store other variables that are not used as much in the rest of the MMX registers. Usual proviso, forget using FP code while doing this.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles
211 cycles for 100 * mov
392 cycles for 100 * movd mm
393 cycles for 100 * movd xmm
211 cycles for 100 * mov
391 cycles for 100 * movd mm
394 cycles for 100 * movd xmm
211 cycles for 100 * mov
392 cycles for 100 * movd mm
393 cycles for 100 * movd xmm
2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm
--- ok ---
P3:
pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles
153 cycles for 100 * mov
159 cycles for 100 * movd mm
139 cycles for 100 * movd xmm
157 cycles for 100 * mov
159 cycles for 100 * movd mm
147 cycles for 100 * movd xmm
153 cycles for 100 * mov
161 cycles for 100 * movd mm
139 cycles for 100 * movd xmm
2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm
Jochen,
here are my results:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++++++++1 of 20 tests valid, loop overhead is approx. 161/100 cycles
12 cycles for 100 * mov
94 cycles for 100 * movd mm
94 cycles for 100 * movd xmm
27 cycles for 100 * mov
420 cycles for 100 * movd mm
461 cycles for 100 * movd xmm
26 cycles for 100 * mov
384 cycles for 100 * movd mm
98 cycles for 100 * movd xmm
2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm
--- ok ---
Gunther
this one doesn't really compare with mov reg32,mem or mov mem,reg32
i.e. is it better to use an MMX/XMM register or a memory location ?
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 243/100 cycles
104 cycles for 100 * mov
1013 cycles for 100 * movd mm
1306 cycles for 100 * movd xmm
106 cycles for 100 * mov
1010 cycles for 100 * movd mm
1306 cycles for 100 * movd xmm
152 cycles for 100 * mov
1007 cycles for 100 * movd mm
1308 cycles for 100 * movd xmm
It seems to be the case that the later the processor, the slower MMX registers get so while they are useful, if you can use XMM registers you are better off than with MMX. I guess this fits with Intel technology where in outright performance terms XMM is favoured and get more optimum silicon in the die layout.
Thanks for the testing.
I don't understand what the movd mm is ?
The results seem to conflict in the timings.
Michael shows MMX to be faster, but not the rest.
If you have a multiple core processor, does that mean you have multiple copies of MMX registers as well ?
Andy
No Andy, it means older processors performed differently to later ones. From about the Core2 series onwards, XMM got progressively faster and MMX got progressively slower.
Quote from: Magnum on March 03, 2013, 01:03:06 AM
I don't understand what the movd mm is ?
Andy
MOVD means MOV DWORD
movd mm means MOV DWORD to lower part of MMX register.
This is not the best way to use them, however.
They are supposed to be used to move 8 bytes at a time:
MOVQ MM0, MM1 for example. Or you can mov 8 bytes
vars from/into MMX registers.
Andy,
Among other things you can use MMX registers to perform this task.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
tstproc proc
; ---------------------------
; preserve required registers
; ---------------------------
movd mm1, ebx
movd mm2, esi
movd mm3, edi
movd mm4, ebp
movd mm5, esp
; -----------------------------------------------
; read the values in ESP into preserved registers
; -----------------------------------------------
mov ebp, [esp+4]
mov ebx, [esp+8]
mov esi, [esp+12]
mov edi, [esp+16]
; ****************************************************************
; write 8 register code here
; ****************************************************************
; ----------------------------------------------
; write value to ESP "AFTER" values read from it
; ----------------------------------------------
mov esp, 1
; ----------------------------------------------------------
; write values to registers that do not need to be preserved
; ----------------------------------------------------------
mov eax, 2
mov ecx, 3
mov edx, 4
; ****************************************************************
; ****************************************************************
; -----------------------------------
; restore register values before exit
; -----------------------------------
movd esp, mm5
movd ebp, mm4
movd edi, mm3
movd esi, mm2
movd ebx, mm1
ret 16
tstproc endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
If I have this written correctly, then XMM is quite a bit slower than EMM.
I am going to run it in safe mode to see if there is more consistency.
Andy
;
;
;
INCLUDE \masm32\include\masm32rt.inc
.686p
.MMX
.XMM
INCLUDE \masm32\macros\timers.asm
LOOP_COUNT = 1000000 ;try to choose a value so each run takes about 0.5 seconds
.DATA
.DATA?
.CODE
_main PROC
; Bind the processor to a single core and delay
INVOKE GetCurrentProcess
INVOKE SetProcessAffinityMask,eax,1
INVOKE Sleep,300
print "XMM instructions in HIGH_PRIORITY_CLASS",13,10
print " ",13,10
mov ecx,10
loop00:
push ecx
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
;code to be timed goes here
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
xor eax,eax
xorps XMM0,XMM0 ; Clear xmm0
mov ebx,7
movd XMM0,ebx
movd edx,XMM0
;-------------------------
counter_end
print ustr$(eax),44,32
pop ecx
dec ecx
jnz loop00
print chr$(13,10)
print " ",13,10
print "MM0 Instructions in HIGH_PRIORITY_CLASS",13,10
print " ",13,10
mov ecx,10
loop01:
push ecx
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
xor eax,eax
pxor MM0,MM0 ; zero out MM1
mov ebx,7
movd MM0,ebx ; move value into 64 bit register
movd ecx,MM0
counter_end
print ustr$(eax),44,32
pop ecx
dec ecx
jnz loop01
print chr$(13,10)
inkey
exit
_main ENDP
END _main
it may run differently on different CPU's
that's why we run things in the laboratory sub-forum
so we can pick "what is best overall", rather than just "what is best on my CPU"
safe mode will probably make them both slower - just a guess :biggrin:
On my CORE Duo the MMX and XMM registers perform more or less
at the same speed:
Quote
XMM instructions in HIGH_PRIORITY_CLASS
9, 8, 8, 8, 8, 7, 8, 8, 8, 8,
MM0 Instructions in HIGH_PRIORITY_CLASS
8, 8, 8, 8, 8, 9, 8, 9, 8, 8,
Press any key to continue ...
Thanks frktons.
Dave,
I can't see how safe mode would have any negative effect.
XMM and EMM are built into the chip.
If anything, I would think they would be faster in safe mode because of minimal drivers and preloaded programs.
But I may be wrong.
Andy
QuoteIntel(R) Core(TM) i5-3570K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 195/100 cycles
?? cycles for 100 * mov
84 cycles for 100 * movd mm
81 cycles for 100 * movd xmm
8 cycles for 100 * mov
81 cycles for 100 * movd mm
81 cycles for 100 * movd xmm
13 cycles for 100 * mov
87 cycles for 100 * movd mm
86 cycles for 100 * movd xmm
2 bytes for mov
4 bytes for movd mm
4 bytes for movd xmm
Just thought I'd pop in to remind everyone about Ivy Bridge and its mov reg32, reg32 renaming optimization.
Hi kode54,
Quote from: kode54 on March 17, 2013, 06:25:22 AM
Just thought I'd pop in to remind everyone about Ivy Bridge and its mov reg32, reg32 renaming optimization.
that's right, but Sandy Bridge and Ivy Bridge are not very wide spread at the present time. So, a lot of forum members can notice that, but not use that.
Gunther