Use of MMX regs

Magnum · March 02, 2013, 04:16:22 AM

Is there any reason why the MMX registers can't be used for 8 additional general purpose registers. ?


start:

    push   ebx                  
    push   ecx
    push   edx
    ; Check feature flag 23 in EDX for MMX support
    mov    eax, 1               
    cpuid                       
    mov    eax, edx           
    shr    eax, 23              
    and    eax, 1                   
    ; Restore registers
    pop    edx                  
    pop    ecx
    pop    ebx

; not sure if these are well suited for intergers

; xor eax,eax ; clear rax
; movss XMM0,value1

mov eax,10
movd MM0,eax     ; move value into 64 bit register
movd  ecx,MM0
xor  eax,eax

; pxor MM1,MM1 ; zero out MM1

; EMMS ; really slow, around 50 cycles

; Use if:
; You call any library routines or OS APIs (that might possibly use the FPU).
; You switch tasks in a cooperative fashion 
; You execute any FPU instructions. 

invoke ExitProcess,ecx

frktons · March 02, 2013, 11:44:00 AM

Quote from: Magnum on March 02, 2013, 04:16:22 AM
Is there any reason why the MMX registers can't be used for 8 additional general purpose registers. ?

You can use MMX registers in any way they can be used, provided you don't use FPU code.
They are only aliased on FPU registers, not registers [hardware based] themselves,
as GPRs are.

bomz · March 02, 2013, 12:35:12 PM

1 reason - old machine without mmx\sse

dedndave · March 02, 2013, 01:30:09 PM

the advantage of general registers is speed
swapping between MMX/FPU registers probably doesn't support that advantage
you might want to run a few timing tests to see if it isn't just as fast to use a memory location

jj2007 · March 02, 2013, 04:37:51 PM

If you can live with 0.3 cycles less speed, or without the FPU, why not? But no flags...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

176 cycles for 100 * mov
167 cycles for 100 * movd mm
188 cycles for 100 * movd xmm

176 cycles for 100 * mov
168 cycles for 100 * movd mm
190 cycles for 100 * movd xmm

178 cycles for 100 * mov
167 cycles for 100 * movd mm
188 cycles for 100 * movd xmm

2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm

sinsi · March 02, 2013, 05:05:15 PM

Hmmm...

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

60 cycles for 100 * mov
1079 cycles for 100 * movd mm
951 cycles for 100 * movd xmm

59 cycles for 100 * mov
1082 cycles for 100 * movd mm
967 cycles for 100 * movd xmm

62 cycles for 100 * mov
1077 cycles for 100 * movd mm
953 cycles for 100 * movd xmm

hutch-- · March 02, 2013, 05:33:37 PM

Its supposed to be "politically incorrect" to mix MMX and integer registers but I have never had any problems doing it. If you like living dangerously with all 8 32 bit registers, you can plonk ESP into an MMX register and store other variables that are not used as much in the rest of the MMX registers. Usual proviso, forget using FP code while doing this.

Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles

211 cycles for 100 * mov
392 cycles for 100 * movd mm
393 cycles for 100 * movd xmm

211 cycles for 100 * mov
391 cycles for 100 * movd mm
394 cycles for 100 * movd xmm

211 cycles for 100 * mov
392 cycles for 100 * movd mm
393 cycles for 100 * movd xmm

2 bytes for mov
4 bytes for movd mm
6 bytes for movd xmm

--- ok ---

MichaelW · March 02, 2013, 07:10:06 PM

P3:

Code Select


pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

153     cycles for 100 * mov
159     cycles for 100 * movd mm
139     cycles for 100 * movd xmm

157     cycles for 100 * mov
159     cycles for 100 * movd mm
147     cycles for 100 * movd xmm

153     cycles for 100 * mov
161     cycles for 100 * movd mm
139     cycles for 100 * movd xmm

2       bytes for mov
4       bytes for movd mm
6       bytes for movd xmm

Gunther · March 02, 2013, 11:07:26 PM

Jochen,

here are my results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++++++++1 of 20 tests valid, loop overhead is approx. 161/100 cycles

12   cycles for 100 * mov
94   cycles for 100 * movd mm
94   cycles for 100 * movd xmm

27   cycles for 100 * mov
420   cycles for 100 * movd mm
461   cycles for 100 * movd xmm

26   cycles for 100 * mov
384   cycles for 100 * movd mm
98   cycles for 100 * movd xmm

2   bytes for mov
4   bytes for movd mm
6   bytes for movd xmm

--- ok ---

Gunther

dedndave · March 02, 2013, 11:33:36 PM

this one doesn't really compare with mov reg32,mem or mov mem,reg32
i.e. is it better to use an MMX/XMM register or a memory location ?

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 243/100 cycles

104     cycles for 100 * mov
1013    cycles for 100 * movd mm
1306    cycles for 100 * movd xmm

106     cycles for 100 * mov
1010    cycles for 100 * movd mm
1306    cycles for 100 * movd xmm

152     cycles for 100 * mov
1007    cycles for 100 * movd mm
1308    cycles for 100 * movd xmm

hutch-- · March 03, 2013, 12:14:24 AM

It seems to be the case that the later the processor, the slower MMX registers get so while they are useful, if you can use XMM registers you are better off than with MMX. I guess this fits with Intel technology where in outright performance terms XMM is favoured and get more optimum silicon in the die layout.

Magnum · March 03, 2013, 01:03:06 AM

Thanks for the testing.

I don't understand what the movd mm is ?

The results seem to conflict in the timings.

Michael shows MMX to be faster, but not the rest.

If you have a multiple core processor, does that mean you have multiple copies of MMX registers as well ?

Andy

hutch-- · March 03, 2013, 01:27:12 AM

No Andy, it means older processors performed differently to later ones. From about the Core2 series onwards, XMM got progressively faster and MMX got progressively slower.

frktons · March 03, 2013, 02:41:11 AM

Quote from: Magnum on March 03, 2013, 01:03:06 AM

I don't understand what the movd mm is ?

Andy

MOVD means MOV DWORD
movd mm means MOV DWORD to lower part of MMX register.
This is not the best way to use them, however.

They are supposed to be used to move 8 bytes at a time:
MOVQ MM0, MM1 for example. Or you can mov 8 bytes
vars from/into MMX registers.

hutch-- · March 03, 2013, 04:16:44 PM

Andy,

Among other things you can use MMX registers to perform this task.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

tstproc proc

; ---------------------------
; preserve required registers
; ---------------------------
movd mm1, ebx
movd mm2, esi
movd mm3, edi
movd mm4, ebp
movd mm5, esp

; -----------------------------------------------
; read the values in ESP into preserved registers
; -----------------------------------------------
mov ebp, [esp+4]
mov ebx, [esp+8]
mov esi, [esp+12]
mov edi, [esp+16]

; ****************************************************************
; write 8 register code here
; ****************************************************************
; ----------------------------------------------
; write value to ESP "AFTER" values read from it
; ----------------------------------------------
mov esp, 1

; ----------------------------------------------------------
; write values to registers that do not need to be preserved
; ----------------------------------------------------------
mov eax, 2
mov ecx, 3
mov edx, 4

; ****************************************************************
; ****************************************************************

; -----------------------------------
; restore register values before exit
; -----------------------------------
movd esp, mm5
movd ebp, mm4
movd edi, mm3
movd esi, mm2
movd ebx, mm1

ret 16

tstproc endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

The MASM Forum

News:

Use of MMX regs

Magnum

frktons

bomz

dedndave

jj2007

sinsi

hutch--

MichaelW

Gunther

dedndave

hutch--

Magnum

hutch--

frktons

hutch--