Author Topic: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords  (Read 46543 times)

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #45 on: November 27, 2012, 11:25:12 AM »
Seems to be possible to compare the low 8 bytes:
Code: [Select]
COMISD dest,source

The destination operand is an XMM register.
The source can be either an XMM register or a memory location.

The flags are set according to the following rules:
Result Flags  Values
Unordered ZF,PF,CF  111
Greater than ZF,PF,CF  000
Less than ZF,PF,CF  001
Equal ZF,PF,CF  100

Maybe it's possible to shift (or rotate) the regs and then compare the high 8 bytes?

Probably there are many ways to do it in more than 1 step.
I'm trying to find a single SIMD instruction, like PTEST, for the task
included in level SSE3.
Some more checking and I'll see.


jj2007

  • Member
  • *****
  • Posts: 10544
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #46 on: November 27, 2012, 11:27:06 AM »
Let's assume I use:

Code: [Select]
   PCMPEQD xmm0,xmm1

considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #47 on: November 27, 2012, 11:48:54 AM »
Let's assume I use:

Code: [Select]
   PCMPEQD xmm0,xmm1

considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax

Thanks Jochen, I'll arrange a new algo to test with your
suggestion.

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #48 on: November 27, 2012, 09:37:20 PM »
I wrote a new CheckDestX PROC to use Jochen suggestion:
Code: [Select]
; -----------------------------------------------------------------------------------------------
CheckDestX proc

    lea eax, Dest
    mov ebx, 32323232h
   
    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in edx
    test edx, edx   

    jne CheckErr
   
       
    add eax, 16
    dec ecx
    jnz @B
 
 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret
 
CheckDestX endp

It gives the same results as CheckDest PROC and
probably is quite fast, but I didn't still test the performance of it.

But I'm still not satisfied from CPUID results:
Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13876   cycles for MOV AX - Test OK
8740    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12336   STOSB - Test OK
---------------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12330   STOSB - Test OK
---------------------------------------------------------

--- ok ---

This time I've used PrintCpu and MasmBasic include,
but the results are still not accurate. My PC has SSSE3
capability, not SSE4.

Only Alex's code that I used a couple of year ago gives
a more accurate result:
Code: [Select]
┌─────────────────────────────────────────────────────────────[27-Nov-2012 at 10:57 GMT]─┐
│OS  : Microsoft Windows 7 Ultimate Edition, 64-bit Service Pack 1 (build 7601)          │
│CPU : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz with 2 logical core(s) with SSSE3           │

 I've read the thread about the CPUID code, but didn't find anything new.
Should I still use Alex's code or there is a more accurate routine for modern
CPU?


dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #49 on: November 27, 2012, 10:16:07 PM »
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #50 on: November 27, 2012, 10:37:52 PM »
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on

Yes Dave, the reasoning is quite fair.
I'm talking about the uncorrect data shown by old routines
while we have newer routines, like Alex's one, that are more
accurate, even if they don' go above SSE4.X.
Jochen's library is quite up to date and uses many SSE opcode [I imagine]
but the Macro [I think] PrintCpu should be updated to be more
correct, doesn't matter if it doesn't cover last AVX code or the like.

Well it is just my opinion, of course. Even the CPUID utility that Intel gives us
http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838
doesn't show that my PC has SSSE3 capabilities, but at least it doesn't say I have
SSE4.

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #51 on: November 27, 2012, 10:59:01 PM »
oh - i see what you mean
well - there have been a few that report erroneously
but, to programatically determine if a specific extension is supported is pretty easy
i.e., i wouldn't use "Alex's" or "Jochen's" or even "Dave's" routine
their purpose is to identify the CPU and capabilities, primarily for forum comparisons

that is a different function than identifying extension support for a program to select routines
what you want to do is actually much simpler   :t

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #52 on: November 27, 2012, 11:33:13 PM »
Code: [Select]
;               0_1 values come from CPUID function 1
;               8_1 values come from CPUID function 80000001h
;
;                Source        Description
;
;                0_1edx:23     MMX
;                8_1edx:22     MMX+    (AMD only)
;                8_1edx:31     3DNow!  (AMD only)
;                8_1edx:30     3DNow!+ (AMD only)
;                0_1edx:25     SSE
;                0_1edx:26     SSE2
;                0_1ecx:00     SSE3
;                0_1ecx:09     SSSE3
;                0_1ecx:19     SSE4.1
;                0_1ecx:20     SSE4.2  (Intel only)
;                8_1ecx:06     SSE4a   (AMD only)
;                8_1ecx:11     SSE5    (AMD only) - this became one of the AVX feature bits

you can get most of what you want to know by examining ECX and EDX after this...
Code: [Select]
        mov     eax,1
        cpuid
for example, ECX bit 0 will be 1 if SSE3 is supported

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #53 on: November 27, 2012, 11:40:19 PM »
Thanks Dave.

CPUID is still an unknown land, I've never been in those bit-area.
Your introduction to the matter looks interesting, I'll give it a try.  :t

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #54 on: November 27, 2012, 11:41:13 PM »
i updated it a little Frank - you may want to reload the page   :P

oh - and you have to use .586 or higher  to use CPUID   :t

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #55 on: November 27, 2012, 11:46:28 PM »
I SEE SSE on the SEASHORE  :icon_eek: 8)
Good to know.

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #56 on: November 27, 2012, 11:50:06 PM »
say that 5 times real fast   :lol:

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #57 on: November 28, 2012, 12:59:33 AM »
Code: [Select]
psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax


This code is a little bit faster on my Core 2 duo:
Code: [Select]
    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0 


nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #58 on: November 28, 2012, 02:01:18 AM »
Code: [Select]
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm ; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]

MAIN_COUNT = 3
LOOP_COUNT = 2000

.data

align 16
v1 dd 0,1,0,1
v2 dd 1,0,1,0

.code
start:
push 1
call ShowCpu ; print brand string and SSE level
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text, 13, 10
endm

;----------------------------------------------

test_start
mov edx,offset v1
mov ebx,offset v2
mov ecx,LOOP_COUNT
movdqa xmm1,[ebx]
    @@:
movdqa xmm0,[edx]
pcmpeqd xmm0,xmm1
pmovmskb eax,xmm0
movdqa xmm0,[ebx]
pcmpeqd xmm0,xmm1
pmovmskb eax,xmm0
; cmp ax,0FFFFh
dec ecx
jnz @b
test_end "cycles for XMM/pcmpeqd"

;----------------------------------------------

test_start
mov edx,offset v1
mov ebx,offset v2
mov ecx,LOOP_COUNT
movdqa xmm1,[ebx]
    @@:
movdqa xmm0,[edx]
psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
movdqa xmm0,[ebx]
psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
; test eax,eax
dec ecx
jnz @b
test_end "cycles for XMM/psubd"

print "---------------------------------------------------------", 13, 10
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:
inkey chr$(13, 10, "--- ok ---", 13)
exit

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

end start

Quote
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
4010    cycles for XMM/pcmpeqd
4069    cycles for XMM/psubd
---------------------------------------------------------
4012    cycles for XMM/pcmpeqd
4062    cycles for XMM/psubd
---------------------------------------------------------
4010    cycles for XMM/pcmpeqd
4041    cycles for XMM/psubd
---------------------------------------------------------

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #59 on: November 28, 2012, 07:26:30 AM »
Well nidud  :t

this seems to work as well as psubd, at the same performance.
So we have a couple of alternatives, at least.