Author Topic: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords  (Read 46624 times)

habran

  • Member
  • *****
  • Posts: 1225
    • uasm
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #60 on: November 28, 2012, 08:50:00 AM »
nidud's code:

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
---------------------------------------------------------
2988    cycles for XMM/pcmpeqd
3004    cycles for XMM/psubd
---------------------------------------------------------
2987    cycles for XMM/pcmpeqd
3012    cycles for XMM/psubd
---------------------------------------------------------
2978    cycles for XMM/pcmpeqd
3001    cycles for XMM/psubd
---------------------------------------------------------

--- ok ---
Cod-Father

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #61 on: November 28, 2012, 09:48:05 AM »
Code: [Select]
----------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4144    cycles for MMX/PUNPCKLBW - Test OK
3158    cycles for XMM/PSHUFB - I shot - Test OK
2368    cycles for XMM/PSHUFB - II shot - Test OK
12328   cycles for STOSB - Test OK
2070    cycles for CheckDest - Test OK
547     cycles for CheckDestC - Test OK
544     cycles for CheckDestX - Test OK
----------------------------------------------------
9241    cycles for MOV AX - Test OK
8728    cycles for LEA - Test OK
4130    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2379    cycles for XMM/PSHUFB - II shot - Test OK
12335   cycles for STOSB - Test OK
2069    cycles for CheckDest - Test OK
548     cycles for CheckDestC - Test OK
543     cycles for CheckDestX - Test OK
----------------------------------------------------

CheckDestC is nidud's code modified. For the CPU and SSE level
I used Alex's routine.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #62 on: November 28, 2012, 05:53:33 PM »
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.

Code: [Select]
; SSETEST.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;
; make:
; jwasm /c /coff ssetest.asm
; link /SUBSYSTEM:CONSOLE ssetest.obj
;
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm

AxCPUid_Print PROTO

MAIN_COUNT = 4
LOOP_COUNT = 4096/16

.data

align 16
mask1 dd  0004080Ch,01010101h,01010101h,01010101h     ; for PFHUFB
ptrmask dd  mask1
PtrDest dd  Dest
PtrSource dd  Source
CPU_Count dd  0

align 8
Check db  8  dup(20h),0,0,0,0
PtrCheck dd  Check

align 8
TestOK db  "Test OK ",0,0,0,0
align 8
TestERR db  "Test ERR",0,0,0,0

.data?

align 16
Dest db 4096 dup(?)
Source dd 4096 dup(?)

.code

start:

;-------------------------------------------------------------------------------
; Before starting the test, the Dest buffer is blanked and the source buffer
; is initialized with dwords of "X000" to make it possibile check of the results
; at the end of each tested algo.
;-------------------------------------------------------------------------------

call  BlankDest
call  InitSource
print "---------------------------------------------------------", 13, 10

;     PrintCpu
invoke AxCPUid_Print
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ecx,LOOP_COUNT
mov ebx,offset Source
mov eax,offset Dest
endm

test_end macro text
counter_end
mov CPU_Count, eax
call CheckDestX
print str$(CPU_Count), 9, text
print PtrCheck, 13, 10
call BlankDest
endm

;----------------------------------------------
if 1
test_start
mov esi,ebx
mov edi,eax
      @@:
test_stosb macro
lodsd
stosb
lodsd
stosb
lodsd
stosb
lodsd
stosb
endm
test_stosb
test_stosb
test_stosb
test_stosb
dec ecx
jnz @B
test_end "cycles for STOSB - "

;----------------------------------------------

test_start
      @@:
test_lea macro o_des, o_src
      mov dh,[ebx+o_src+12]
mov dl,[ebx+o_src+8]
shl edx,16
mov dh,[ebx+o_src+4]
mov dl,[ebx+o_src]
mov [eax+o_des],edx
endm
test_lea  0,0
test_lea  4,16
test_lea  8,32
test_lea 12,48
lea eax,[eax+16]
lea ebx,[ebx+64]
dec ecx
jnz @B
test_end "cycles for LEA - "

;----------------------------------------------

test_start
      @@:
test_mov_dx macro o_des, o_src
mov dl,[ebx+o_src]
mov dh,[ebx+o_src+4]
mov [eax+o_des],dx
mov dl,[ebx+o_src+8]
mov dh,[ebx+o_src+12]
mov [eax+o_des+2],dx
endm
test_mov_dx  0,0
test_mov_dx  4,16
test_mov_dx  8,32
test_mov_dx 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for MOV DX - "

;----------------------------------------------

test_start
mov esi,ebx
mov edi,eax
      @@:
test_mov_ax macro o_des, o_src
mov al,[esi+o_src]
mov ah,[esi+o_src+4]
mov [edi+o_des],ax
mov al,[ebx+o_src+8]
mov ah,[ebx+o_src+12]
mov [edi+o_des+2],ax
endm
test_mov_ax  0,0
test_mov_ax  4,16
test_mov_ax  8,32
test_mov_ax 12,48
add esi,64
add edi,16
dec ecx
jnz @B
test_end "cycles for MOV AX - "

;----------------------------------------------

test_start
      @@:
test_punpcklbw macro o_des, o_src
movd mm0,dword ptr [ebx+o_src]
movd mm1,dword ptr [ebx+o_src+4]
movd mm2,dword ptr [ebx+o_src+8]
movd mm3,dword ptr [ebx+o_src+12]
punpcklbw mm0,mm2
punpcklbw mm1,mm3
punpcklbw mm0,mm1
movd dword ptr [eax+o_des],mm0
endm
test_punpcklbw  0,0
test_punpcklbw  4,16
test_punpcklbw  8,32
test_punpcklbw 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for MMX/PUNPCKLBW - "
endif
;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm1,[edx]
      @@:
test_pshufb0 macro o_des, o_src
movdqa xmm0,[ebx+o_src]
pshufb xmm0,xmm1
movd dword ptr [eax+o_des],xmm0
endm
test_pshufb0  0,0
test_pshufb0  4,16
test_pshufb0  8,32
test_pshufb0 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for XMM/PSHUFB - xmm0,xmm1 - "

;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm2,[edx]
      @@:
test_pshufb macro o_des, o_src
movdqa xmm1,[ebx+o_src]
pshufb xmm1,xmm2
movd dword ptr [eax+o_des],xmm1
endm
test_pshufb  0,0
test_pshufb  4,16
test_pshufb  8,32
test_pshufb 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for XMM/PSHUFB - I shot - "

;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm4,[edx]
      @@:
movdqa xmm0, [ebx]
pshufb xmm0, xmm4
movdqa xmm1, [ebx + 16]
pshufb xmm1, xmm4
movdqa xmm2, [ebx + 32]
pshufb xmm2, xmm4
movdqa xmm3, [ebx + 48]
pshufb xmm3, xmm4
movd dword ptr [eax], xmm0
movd dword ptr [eax + 4], xmm1
movd dword ptr [eax + 8], xmm2
movd dword ptr [eax + 12], xmm3
add ebx, 64
add eax, 16
dec ecx
jnz @b
test_end "cycles for XMM/PSHUFB - II shot - "

;----------------------------------------------
if 0
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDest

counter_end

    mov CPU_Count, eax

    print str$(CPU_Count), 9, "cycles for CheckDest - "
    print PtrCheck, 13, 10

;----------------------------------------------

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDestC

counter_end

    mov CPU_Count, eax

    print str$(CPU_Count), 9, "cycles for CheckDestC - "
    print PtrCheck, 13, 10


;----------------------------------------------

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDestX

counter_end

    mov CPU_Count, eax
    call CheckDestX
print str$(CPU_Count), 9, "cycles for CheckDestX - "
    print PtrCheck, 13, 10
endif
    print "---------------------------------------------------------", 13, 10

;----------------------------------------------
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:

inkey chr$(13, 10, "--- ok ---", 13)
exit



; -----------------------------------------------------------------------------------------------
BlankDest proc

    lea eax, Dest
    mov ebx, 20202020h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0


 @@:

    movdqa [eax], xmm0
    add eax, 16
    dec ecx
    jnz @B

    ret

BlankDest endp

; -----------------------------------------------------------------------------------------------
InitSource proc

    lea eax, Source
    mov ebx, 20202032h

    mov ecx, (4096/4)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0


 @@:

    movdqa [eax], xmm0
    add eax, 16
    dec ecx
    jnz @B
 ret

InitSource endp

; -----------------------------------------------------------------------------------------------
CheckDest proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/4)

 @@:

     mov  edx, [eax]
     cmp  edx, ebx


    jne CheckErr


    add eax, 4
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDest endp

; -----------------------------------------------------------------------------------------------
CheckDestX proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    psubd xmm1, xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0

    jne CheckErr


    add eax, 16
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDestX endp

; -----------------------------------------------------------------------------------------------
CheckDestC proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    pcmpeqd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0FFFFh

    jne CheckErr

    add eax, 16
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDestC endp


;#############################################################################
; Instructions detection code by Alex aka Antariy
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

AxCPUid_Features struct DWORD
Is486capable dd ?
IsP1 dd ?
IsP1MMX dd ?
IsPPro dd ?
IsSSE1 dd ?
IsSSE2 dd ?
IsSSE3 dd ?
IsSSSE3 dd ?
IsSSE41 dd ?
IsSSE42 dd ?
BrandName db 64 dup (?)
AxCPUid_Features ends

AxCPUidCodeSizeStart EQU $


; Return values:
; Not zero - if entire structure was filled
; 0 - if CPUID is not supported or not entire struct beed filled.
;     For clearness, check the Is486capable - if 0 then CPUID is not supported,
;     otherwise CPU family less than PIV, but structure are filled properly.

align 16
AxCPUid_FillStructure proc lpAxCPUid_Features:PTR AxCPUid_Features
.data?
ifrunnedAndSupportCPUid dd ?
.code

push ebp
push ebx

xor eax,eax
mov ecx,sizeof AxCPUid_Features

mov ebp,[esp+4+8]

@@:
mov [ebp+ecx-4].AxCPUid_Features.Is486capable,eax
add ecx,-4
jnz @B

mov ebx,ifrunnedAndSupportCPUid
dec ebx
jz @done
dec ebx
jz @is486capable

inc ifrunnedAndSupportCPUid

pushfd
pop ecx
mov ebx,ecx
xor ebx,200000h ; switch ID flag
push ebx
popfd ; change EFLAGS
pushfd
pop ebx
xor ebx,ecx ; if previous ID bit was equal to current
test ebx,200000h; then it would be dropped
jz @done ; otherwise - it set

inc ifrunnedAndSupportCPUid

@is486capable:

or [ebp].AxCPUid_Features.Is486capable,1

xor eax,eax
cpuid
test eax,eax
jz @done

mov eax,1
cpuid

shr edx,5 ; has RDTSC - so, this is PI at least
adc [ebp].AxCPUid_Features.IsP1,0

shr edx,2 ; has PAE - so, PPro at least
adc [ebp].AxCPUid_Features.IsPPro,0

shr edx,17 ; has MMX
adc [ebp].AxCPUid_Features.IsP1MMX,0

shr edx,2 ; has SSE1
adc [ebp].AxCPUid_Features.IsSSE1,0

shr edx,1 ; has SSE2
adc [ebp].AxCPUid_Features.IsSSE2,0



shr ecx,1 ; has SSE3
adc [ebp].AxCPUid_Features.IsSSE3,0

shr ecx,9 ; has SSSE3
adc [ebp].AxCPUid_Features.IsSSSE3,0

shr ecx,10 ; has SSE4.1
adc [ebp].AxCPUid_Features.IsSSE41,0

shr ecx,1 ; has SSE4.2
adc [ebp].AxCPUid_Features.IsSSE42,0


; get CPU brand name, if exist


; fix for PIII, not SSE2 capable CPU cannot have brand name
cmp [ebp].AxCPUid_Features.IsSSE2,0
jz @done

mov eax,80000000h
cpuid
;pushad
;print hex$(eax),9,"Debug message: extended functions count return",13,10,13,10
;popad
add eax,eax ; check for zero and no any extended functions
jz @done

lea ebp,[ebp].AxCPUid_Features.BrandName
cmp eax,8 ; needed at least 80000004h function
mov eax,0
jb @done

push esi
mov esi,80000002h
@@:
mov eax,esi
cpuid
inc esi
mov [ebp],eax
mov [ebp+4],ebx
mov [ebp+8],ecx
mov [ebp+12],edx
add ebp,16
cmp esi,80000005h
jb @B
pop esi

or eax,1 ; in case of terminated brand string

@done:
pop ebx
pop ebp
ret 4
AxCPUid_FillStructure endp


; Return values:
; 0 - need upgrade
; above 0 - then supported:
; 1 - MMX
; 2 - SSE1
; 3 - SSE2
; 4 - SSE3
; 5 - SSSE3
; 6 - SSE4.1
; 7 - SSE4.2
align 16
AxCPUid_Print proc
push ebx
push esi
push edi
add esp,-1028

invoke GetStdHandle,STD_OUTPUT_HANDLE
xchg eax,ebx

invoke AxCPUid_FillStructure,esp
cmp [esp].AxCPUid_Features.Is486capable,0
jnz @F

push eax
mov edx,esp

push 0
push edx
push sizeof @@needupgrade
push offset @@needupgrade
push ebx
call WriteFile
pop ecx
xor eax,eax
jmp @done

@@:

mov esi,esp

mov edi,[esi].AxCPUid_Features.IsP1MMX
add edi,[esi].AxCPUid_Features.IsSSE1
add edi,[esi].AxCPUid_Features.IsSSE2
add edi,[esi].AxCPUid_Features.IsSSE3
add edi,[esi].AxCPUid_Features.IsSSSE3
add edi,[esi].AxCPUid_Features.IsSSE41
add edi,[esi].AxCPUid_Features.IsSSE42

mov eax,[esi].AxCPUid_Features.IsSSE42
lea eax,[eax*2+offset @@sse42]

mov edx,[esi].AxCPUid_Features.IsSSE41
lea edx,[edx*2+offset @@sse41]

mov ecx,[esi].AxCPUid_Features.IsSSSE3
lea ecx,[ecx*2+offset @@ssse3]

push eax
push edx
push ecx

mov eax,[esi].AxCPUid_Features.IsSSE3
lea eax,[eax*2+offset @@sse3]

mov edx,[esi].AxCPUid_Features.IsSSE2
lea edx,[edx*2+offset @@sse2]

mov ecx,[esi].AxCPUid_Features.IsSSE1
lea ecx,[ecx*2+offset @@sse1]

push eax
push edx
push ecx

mov eax,[esi].AxCPUid_Features.IsP1MMX
lea eax,[eax*2+offset @@mmx]

lea edx,[esi].AxCPUid_Features.BrandName
cmp dword ptr [edx],0
jnz @hasbrandname

mov ecx,[esi].AxCPUid_Features.Is486capable
add ecx,[esi].AxCPUid_Features.IsP1
add ecx,[esi].AxCPUid_Features.IsP1MMX
add ecx,[esi].AxCPUid_Features.IsPPro

cmp ecx,4
jb @F
add ecx,[esi].AxCPUid_Features.IsP1MMX ; PII is PPro with MMX
add ecx,[esi].AxCPUid_Features.IsSSE1 ; PIII is PPro with MMX and SSE1
@@:

mov edx,[ecx*4+offset @@earlycpus]

@hasbrandname:
add edx,1
cmp byte ptr [edx-1]," "
jz @hasbrandname

dec edx

push eax
push edx

push offset @@fmtFeatures
push esi
call wsprintf

mov edx,esp

invoke WriteFile,ebx,esi,eax,edx,0

add esp,10*4

xchg eax,edi

@done:
add esp,1028
pop edi
pop esi
pop ebx
ret

even
@@needupgrade db "This is time for upgrade indeed, i386 or early i486 :)"
even
@@i486 db "Old-good 80486",0
even
@@p1 db "Pentium",0
even
@@pmmx db "Pentium with MMX Technology",0
even
@@ppro db "Pentium Pro",0
even
@@p2 db "Pentium II",0
even
@@p3 db "Pentium III",0

align 4
@@earlycpus dd 0
dd offset @@i486
dd offset @@p1
dd offset @@pmmx
dd offset @@ppro
dd offset @@p2
dd offset @@p3

even
@@fmtFeatures db "%s",13,10,13,10
db "Instructions: %s%s%s%s%s%s%s",13,10,0
even
@@mmx db 0,0
db "MMX",0
even
@@sse1 db 0,0
db ", SSE1",0
even
@@sse2 db 0,0
db ", SSE2",0
even
@@sse3 db 0,0
db ", SSE3",0
even
@@ssse3 db 0,0
db ", SSSE3",0
even
@@sse41 db 0,0
db ", SSE4.1",0
even
@@sse42 db 0,0
db ", SSE4.2",0


AxCPUid_Print endp
AxCPUidCodeSize EQU $-AxCPUidCodeSizeStart

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;#############################################################################

end start

Quote
---------------------------------------------------------
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
14854   cycles for STOSB - Test OK
9466    cycles for LEA - Test OK
7749    cycles for MOV DX - Test OK
7776    cycles for MOV AX - Test OK
4565    cycles for MMX/PUNPCKLBW - Test OK
2074    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
2045    cycles for XMM/PSHUFB - I shot - Test OK
2258    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14823   cycles for STOSB - Test OK
9056    cycles for LEA - Test OK
7850    cycles for MOV DX - Test OK
7787    cycles for MOV AX - Test OK
4672    cycles for MMX/PUNPCKLBW - Test OK
2013    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
2028    cycles for XMM/PSHUFB - I shot - Test OK
2014    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14836   cycles for STOSB - Test OK
8997    cycles for LEA - Test OK
7851    cycles for MOV DX - Test OK
7784    cycles for MOV AX - Test OK
4748    cycles for MMX/PUNPCKLBW - Test OK
1992    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
1137    cycles for XMM/PSHUFB - I shot - Test OK
2006    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14947   cycles for STOSB - Test OK
9250    cycles for LEA - Test OK
7838    cycles for MOV DX - Test OK
7791    cycles for MOV AX - Test OK
4565    cycles for MMX/PUNPCKLBW - Test OK
1985    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
1984    cycles for XMM/PSHUFB - I shot - Test OK
2034    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------

With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.

Edit: renamed test_pshufb to test_pshufb0

habran

  • Member
  • *****
  • Posts: 1225
    • uasm
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #63 on: November 28, 2012, 08:02:29 PM »
last nidud's code produce this on my laptop:

Code: [Select]
---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
6675    cycles for STOSB - Test OK
4240    cycles for LEA - Test OK
3353    cycles for MOV DX - Test OK
3276    cycles for MOV AX - Test OK
1924    cycles for MMX/PUNPCKLBW - Test OK
1213    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
1539    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6093    cycles for STOSB - Test OK
3806    cycles for LEA - Test OK
3403    cycles for MOV DX - Test OK
3277    cycles for MOV AX - Test OK
1945    cycles for MMX/PUNPCKLBW - Test OK
808     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
904     cycles for XMM/PSHUFB - I shot - Test OK
1490    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3668    cycles for MOV DX - Test OK
3684    cycles for MOV AX - Test OK
3044    cycles for MMX/PUNPCKLBW - Test OK
888     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
901     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3240    cycles for MOV DX - Test OK
3255    cycles for MOV AX - Test OK
2527    cycles for MMX/PUNPCKLBW - Test OK
833     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
858     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------

--- ok ---
Cod-Father

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #64 on: November 28, 2012, 10:51:37 PM »
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.

With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.

Edit: renamed test_pshufb to test_pshufb0

Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.
I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't.  ::)

jj2007

  • Member
  • *****
  • Posts: 10557
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #65 on: November 28, 2012, 11:05:21 PM »
The PMOVMSKB does or doesn't detect the difference?

It does. Launch some tests with Olly to see what happens. Anyway, PCM*** does the same job as PSUBD, and they are equally fast (e.g. one cycle on my AMD).

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #66 on: November 29, 2012, 12:22:22 AM »
I compare two XMM register, with one of them greater
than the other.
According to this test, with PSUBD it doesn't detect it ::)
Code: [Select]
------------------------------------
Test on PCMPEQD - Test ERR
------------------------------------
Test on PSUBD   - Test OK
------------------------------------

Press any key to continue ...

This is the code I used. Did I make any error?

Code: [Select]
; ---------------------------------------------------------------------
; TEST_PSUBD.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;-------------------------------------------------------------------------------
; Test the difference between PCMPEQD and PSUBD when comparing two XMM
; registers.
; 28/Nov/2012 - MASM FORUM - frktons
;-------------------------------------------------------------------------------



.nolist
include \masm32\include\masm32rt.inc
.686
.xmm


.data

align 8
Check db  8  dup(20h),0,0,0,0
PtrCheck dd  Check

align 8
TestOK db  "Test OK ",0,0,0,0
align 8
TestERR db  "Test ERR",0,0,0,0


.code

start:


print "---------------------------------------------------------", 13, 10
      print "Test on PCMPEQD - "
      call  PCMP_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10

      print "Test on PSUBD   - "
      call  PSUB_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10, 13, 10
      inkey

      exit
     
; -----------------------------------------------------------------------------------------------
PSUB_TEST proc


    mov ebx, 32323232h
    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    psubd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0

    jne CheckErr

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

PSUB_TEST endp

; -----------------------------------------------------------------------------------------------
PCMP_TEST proc


    mov ebx, 32323232h

    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    pcmpeqd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0FFFFh

    jne CheckErr

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

PCMP_TEST endp


end start

jj2007

  • Member
  • *****
  • Posts: 10557
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #67 on: November 29, 2012, 01:08:17 AM »
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #68 on: November 29, 2012, 01:44:12 AM »
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h


So what is my error? I was aware that the logic is inverted
and I tested:
Code: [Select]
    cmp    dx, 0

    jne CheckErr
for PSUBD, and

Code: [Select]
    cmp   dx, 0FFFFh

    jne CheckErr

for PCMPEQD. ::)
 

jj2007

  • Member
  • *****
  • Posts: 10557
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #69 on: November 29, 2012, 03:12:22 AM »
It seems pcmpeqb returns always zero, unless the xmm bytes are FFh...
Code: [Select]
---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008641903833650
xmm0            3617008641903833650
pcmpeqd out     xmm1            -1

pmovmskb
xmm1            -1
edx             65535
Test OK
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008641903833650
xmm0            3617008641903833650
PSubD out       xmm1            0

pmovmskb
xmm1            0
dx              0
Test OK
---------------------------------------------------------

---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008646198800947
xmm0            3617008641903833650
pcmpeqd out     xmm1            0

pmovmskb
xmm1            0
edx             0
Test ERR
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008646198800947
xmm0            3617008641903833650
PSubD out       xmm1            4294967297

pmovmskb
xmm1            4294967297
dx              0
Test OK
---------------------------------------------------------

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #70 on: November 29, 2012, 04:17:03 AM »
Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.

This:
Code: [Select]
mov al,[esi]
mov dl,[esi+4]
mov cl,[esi+8]
mod bl,[esi+12]
mov [edi],al
mov [edi+1],dl
mov [edi+2],cl
mov [edi+3],bl
Is faster than this:
Code: [Select]
mov ecx,4
@@:
mov al,[esi]
mov [edi],al
add esi,4
add edi,1
dec ecx
jnz @B
Since the loop itself will add extra time to the test.

To even the result, the extra loop was removed:
Code: [Select]
mov al,[esi]
mov [edi],al
mov al,[esi+4]
mov [edi+1],al
mov al,[esi+8]
mov [edi+2],al
mov al,[esi+12]
mov [edi+3],al

I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't.  ::)

Hmm, it seems to be something wrong with this logic.

What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?

If so, the result is 00000000000000000000000000000000h in the first, and 0000000000000000h in the latter.

The cmp function returns 16 bits representing the result
If equal the result is FFFF.

Maybe you have to use the cmp function after all.


nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #71 on: November 29, 2012, 04:37:01 AM »
Maybe you could use CMPNEQPS
The result should then be zero if equal

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #72 on: November 29, 2012, 04:41:35 AM »
When I read the Intel Manuals, about PMOVMSKB
I found something didn't match with the possibility to
compare two XMM register for equality:
Code: [Select]
Creates a mask made up of the most significant bit of each byte of the source
operand (second operand) and stores the result in the low byte or word of the destination
operand (first operand).
If only the MSBits are stored into the destination operand, and the difference is in other
bits, it will not be detected.
So My idea is that after PSUBD we have to use a different opcode to
detect is there are differences other then in the MSBits of the xmm we are testing.

On the other side, using PCMPEQD we can test both the equality and the difference
between the xmm registers, using PMOVMSKB.
This is what I've undestood so far.
Using PSUBD is a smart solution but it need to be followed by something
different than PMOVMSKB, in my opinion.

So far nidud's solution is the one I understand. Waiting for some other solution.

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #73 on: November 29, 2012, 04:43:35 AM »
Maybe you could use CMPNEQPS
The result should then be zero if equal


Yes, probably this opcode will work as well.

Quote
What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?

yes again. So far I think the PCMPEQD variant is the complete one
for testing equality. Something is missing, in my opinion for PSUBD.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #74 on: November 29, 2012, 04:50:32 AM »
What do you want to compare? FP or integer value?
MREAL macros - when you need floating point arithmetic while assembling!