The MASM Forum

General => The Laboratory => Topic started by: frktons on November 25, 2012, 02:48:06 AM

Title: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 02:48:06 AM
Hi everybody.

It has been a long time since the last MASM FOR FUN adventure.

Time to be back?

I hope so, at least for a while.

Let's start  with a question.

I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.

The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.

I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.

Anybody has gone through this problem and found an SSE2/SSE3 solution?

In pseudocode:

Code: [Select]
xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0,  n
And we have CIAO in eax.

CIAO




 



Title: Re: MASM FOR FUN is back?
Post by: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.

Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:06:39 AM
Great to see you back.

Thanks, it's good to be back.

With the old instructions I get:

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098   cycles for MOV DL

28009   cycles for MOV DL


--- ok ---

Attached the source and exe.

Frank
Title: Re: MASM FOR FUN is back?
Post by: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy








































Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:29:02 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.

Frank
Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:56:19 AM
I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:

if I have four xmm registers with the dword read from the dword buffer

Code: [Select]

    movd      xmm0,  [ebx]
    movd      xmm1,  [ebx + 4]   
    movd      xmm2,  [ebx + 8] 
    movd      xmm3,  [ebx + 12] 
   


interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.

While I wait for some suggestions, I carry on thinking and reading  :icon_rolleyes:

Frank
Title: Re: MASM FOR FUN is back?
Post by: CommonTater on November 25, 2012, 06:20:01 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.

However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores.  In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.
Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 06:57:46 AM
Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.

Code: [Select]
mov eax, offset Dest
mov ebx, offset Source
@@:
movd mm0, dword ptr [ebx]
                movd mm1, dword ptr [ebx + 4]
                movd mm2, dword ptr [ebx + 8]
                movd mm3, dword ptr [ebx + 12]

                punpcklbw mm0, mm2
                punpcklbw mm1, mm3   
                punpcklbw mm0, mm1
                 
movd dword ptr [eax], mm0


Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 07:16:57 AM
Well, it apparently works with some speed improvement:

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315   cycles for MOV DL

16479   cycles for MOVD DWORD PTR

28394   cycles for MOV DL

16637   cycles for MOVD DWORD PTR


--- ok ---


I'll see if a there is a better method, with less SSE OPCODES to get
better performance.

Frank
Title: Re: MASM FOR FUN is back?
Post by: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)
Title: Re: MASM FOR FUN - REBORN
Post by: frktons on November 25, 2012, 08:04:02 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

Ciao jj.

I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.

Frank
Title: Re: MASM FOR FUN - REBORN
Post by: jj2007 on November 25, 2012, 09:44:27 AM
What do you think about the solution I used?

Not fast indeed. PSHUFB must be much better, but I can't test it here...

See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.

Did you try the straightforward solution?

include \masm32\include\masm32rt.inc
.data
src   db "xxxCxxxIxxxAxxxO", 0
dest   db 20 dup(?)
.code
start:
   mov esi, offset src+3
   mov edi, offset dest
   REPEAT 4
      lodsd
      stosb
   ENDM
   inkey offset dest
   exit
end start
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 11:52:26 AM
I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12412   cycles for MOV DL
6199    cycles for MMX/MOVD DWORD PTR
21114   cycles for STOSB
---------------------------------------------------------
12401   cycles for MOV DL
6235    cycles for MMX/MOVD DWORD PTR
21137   cycles for STOSB
---------------------------------------------------------

--- ok ---

Frank
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you   :t

this might have a few less dependancies...

Code: [Select]
@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 12:22:09 PM
hiyas Frank - good to see you   :t

this might have a few less dependancies...

Code: [Select]
@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about JNZ:

Code: [Select]
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B
[/code]

does the jnz refers to ECX
or to EAX?


By the way, I inserted your code inside the pgm, but I get strange results:


Code: [Select]

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source

    @@:     mov     edx,[ebx]
            add     ebx,4
            mov     [eax],dl
            dec     ecx           
            lea     eax,[eax+1]           

            jnz     @B
           
            print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10


and:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12396   cycles for MOV DL
6193    cycles for MMX/MOVD DWORD PTR
21098   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------
12394   cycles for MOV DL
6207    cycles for MMX/MOVD DWORD PTR
21093   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------

--- ok ---

Did I mispell something, or what?


 
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 12:45:29 PM
nope - i just write bad code   :lol:

LEA does not affect the flags, so
Code: [Select]
        lea     eax,[eax+1]adds one to EAX without altering the flags that were set by DEC ECX
the idea was to put something in between the instruction that sets the flags and the one that examines them
but, LEA is not a great performer on older CPU's

still, it shouldn't be that slow - lol
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 12:52:01 PM
i got a slight improvement on my p4 prescott

Code: [Select]
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov eax, offset Dest
    mov ecx, 4096
    mov ebx, offset Source

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

; @@:
;     mov edx, [ebx]
;     mov byte ptr [eax], dl
;     add eax, 1
;                add ebx, 4
;     dec ecx
;     jnz @B
counter_end
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: sinsi on November 25, 2012, 01:34:26 PM
Maybe try
Code: [Select]
  mov dl,[ebx]
  mov dh,[ebx+4]
  mov [eax],dx
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on November 25, 2012, 01:44:48 PM
Dave,

It was only the PIV that was a poor performer with LEA, PIII and earlier and Core2 onwards are fine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 01:51:21 PM
i knew it was something like that, Hutch - lol

sinsi has the right idea, i think...
Code: [Select]
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov ebx, offset Source
    mov eax, offset Dest
    mov ecx, 4096/4

@@:     mov     dh,[ebx+12]
        mov     dl,[ebx+8]
        shl     edx,16
        mov     dh,[ebx+4]
        mov     dl,[ebx]
        add     ebx,16
        mov     [eax],edx
        dec     ecx
        lea     eax,[eax+4]
        jnz     @B

counter_end

it would help if the destination array is 4-aligned - maybe the source, too
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 25, 2012, 07:52:55 PM
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
4123   cycles for MOV AX
5127   cycles for LEA
5143   cycles for MMX/MOVD DWORD PTR
17419   cycles for STOSB
---------------------------------------------------------
4121   cycles for MOV AX
5127   cycles for LEA
5238   cycles for MMX/MOVD DWORD PTR
17428   cycles for STOSB
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 25, 2012, 09:04:52 PM
So on your puter Sinsi's solution is clearly the fastest. That's what I suspected ;-)

Not on mine, however:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9968    cycles for MOV AX
9622    cycles for LEA
5181    cycles for MMX/MOVD DWORD PTR
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 10:19:15 PM
On newer machine there is no game:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13874   cycles for MOV AX
13124   cycles for LEA
6193    cycles for MMX/MOVD DWORD PTR
18486   cycles for STOSB
---------------------------------------------------------
13856   cycles for MOV AX
13087   cycles for LEA
4129    cycles for MMX/MOVD DWORD PTR
18516   cycles for STOSB
---------------------------------------------------------

--- ok ---


later the pshufb solution that should win the race.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 11:15:09 PM
Here it is, the first quick shot with PSHUFB:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13866   cycles for MOV AX
13084   cycles for LEA
6205    cycles for MMX/MOVD DWORD PTR
4848    cycles for PSHUFB / I shot
18487   cycles for STOSB
---------------------------------------------------------
13852   cycles for MOV AX
13083   cycles for LEA
6194    cycles for MMX/MOVD DWORD PTR
4730    cycles for PSHUFB / I shot
18518   cycles for STOSB
---------------------------------------------------------

--- ok ---

later I'll try to improve it.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 01:13:49 AM
big improvement for STOSB
here is one using loop
Quote
---------------------------------------------------------
6227   cycles for MOV AX
6165   cycles for LEA
7188   cycles for MMX/MOVD DWORD PTR
19466   cycles for STOSB
---------------------------------------------------------
6228   cycles for MOV AX
6163   cycles for LEA
7188   cycles for MMX/MOVD DWORD PTR
19475   cycles for STOSB
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 02:19:29 AM
Some more tests:

Code: [Select]
Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz (SSSE3)
---------------------------------------------------------
13927   cycles for MOV AX
13097   cycles for LEA
6203    cycles for MMX/PUNPCKLBW
4729    cycles for XMM/PSHUFB - I shot
3518    cycles for XMM/PSHUFB - II shot
15364   cycles for XMM/MASKMOVDQU - I shot
18506   cycles for STOSB
---------------------------------------------------------
13868   cycles for MOV AX
13096   cycles for LEA
6198    cycles for MMX/PUNPCKLBW
4732    cycles for XMM/PSHUFB - I shot
3520    cycles for XMM/PSHUFB - II shot
15360   cycles for XMM/MASKMOVDQU - I shot
18503   cycles for STOSB
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 26, 2012, 03:25:54 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 04:34:28 AM
laptop with Intel Inside W95 16MB RAM
Quote
pre-P4
---------------------------------------------------------
17080   cycles for MOV AX
8935   cycles for LEA
11542   cycles for MMX/MOVD DWORD PTR
32254   cycles for STOSB
---------------------------------------------------------
17057   cycles for MOV AX
9082   cycles for LEA
11753   cycles for MMX/MOVD DWORD PTR
32423   cycles for STOSB
---------------------------------------------------------
using loop
Quote
---------------------------------------------------------
16556   cycles for MOV AX
12435   cycles for LEA
15317   cycles for MMX/MOVD DWORD PTR
32145   cycles for STOSB
---------------------------------------------------------
16556   cycles for MOV AX
12365   cycles for LEA
15312   cycles for MMX/MOVD DWORD PTR
32332   cycles for STOSB
---------------------------------------------------------

pshufbtest.exe

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------------
4121    cycles for MOV AX
5127    cycles for LEA
5140    cycles for MMX/MOVD DWORD PTR
crash..

Quote
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
6334    cycles for MOV AX
5171    cycles for LEA
3123    cycles for MMX/MOVD DWORD PTR
2189    cycles for PSHUFB / I shot
10503   cycles for STOSB
---------------------------------------------------------
5243    cycles for MOV AX
5488    cycles for LEA
3150    cycles for MMX/MOVD DWORD PTR
2060    cycles for PSHUFB / I shot
9276    cycles for STOSB
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 06:12:38 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O


If the mask is correctly set, maskmovdqu should do the job :

F0h = byte to move, 00h = byte non moved, according to Intel Docs:

Code: [Select]
The most significant bit in each byte of the mask operand determines whether the
corresponding byte in the source operand is written to the corresponding byte location
in memory: 0 indicates no write and 1 indicates write.
[/b]


At least my previous test showed it can do the job, but maybe I didn't try it enough. ::)

Edit: it only works fine with consecutive bytes, probably, as you said, not the one I need
beside being slow.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 06:17:17 AM

Quote
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
6334    cycles for MOV AX
5171    cycles for LEA
3123    cycles for MMX/MOVD DWORD PTR
2189    cycles for PSHUFB / I shot
10503   cycles for STOSB
---------------------------------------------------------
5243    cycles for MOV AX
5488    cycles for LEA
3150    cycles for MMX/MOVD DWORD PTR
2060    cycles for PSHUFB / I shot
9276    cycles for STOSB
---------------------------------------------------------


These SIMD instructions work a lot better with modern tech.
Try the last version only on I3.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 06:52:59 AM
Quote
---------------------------------------------------------
5158    cycles for MOV AX
5197    cycles for LEA
3117    cycles for MMX/PUNPCKLBW
2086    cycles for XMM/PSHUFB - I shot
1076    cycles for XMM/PSHUFB - II shot
10499   cycles for XMM/MASKMOVDQU - I shot
10267   cycles for STOSB
---------------------------------------------------------
5190    cycles for MOV AX
5175    cycles for LEA
3091    cycles for MMX/PUNPCKLBW
2062    cycles for XMM/PSHUFB - I shot
1089    cycles for XMM/PSHUFB - II shot
10645   cycles for XMM/MASKMOVDQU - I shot
9282    cycles for STOSB
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 07:28:02 AM
With adjustment for the loop (mov ecx,4096/16):
Quote
---------------------------------------------------------
1125    cycles for XMM/PSHUFB - I shot
1088    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
1124    cycles for XMM/PSHUFB - I shot
1140    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 08:01:46 AM
With adjustment for the loop (mov ecx,4096/16):
Quote
---------------------------------------------------------
1125    cycles for XMM/PSHUFB - I shot
1088    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
1124    cycles for XMM/PSHUFB - I shot
1140    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------


In the first shot 4096/4 refers to the dwords to elaborate in each cycle,
so it cannot be 4096/16.
The second one works on 16 dwords at a time, so it is 4096/16.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 08:19:47 AM
In the first shot 4096/4 refers to the dwords to elaborate in each cycle,
so it cannot be 4096/16.
The second one works on 16 dwords at a time, so it is 4096/16.
Yes, so you have to repeat this 4 times to make it even.
Code: [Select]
mov eax,offset Dest
mov ecx,4096/16
mov ebx,offset Source
    @@:
    movdqa xmm1, [ebx]
    pshufb xmm1, xmm2
    movd dword ptr [eax], xmm1

    movdqa xmm1, [ebx+16]
    pshufb xmm1, xmm2
    movd dword ptr [eax+4], xmm1

    movdqa xmm1, [ebx+32]
    pshufb xmm1, xmm2
    movd dword ptr [eax+8], xmm1

    movdqa xmm1, [ebx+48]
    pshufb xmm1, xmm2
    movd dword ptr [eax+12], xmm1

    add  ebx, 64
    add  eax, 16
    dec ecx
    jnz @b
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 08:57:04 AM
I'm going to prepare a real test, with some data to make the masks
a little bit more accurate. They are not tested for the time being, and
were used just to have an idea of their performances.

After testing on real data and adjusting the masks accordingly, the
test could be considered valid.

Up to now I've worked on uninitializes data, so there is no way to
know if the sequence of bit/bytes in the masks are correct.  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 09:43:40 AM
I think it does what it suppose to do:
Code: [Select]
SetBuffer:
mov edi,offset Dest
mov ecx,4096
mov al,1
rep stosb
ret

TestBuffer:
mov esi,offset Dest
mov ecx,4096
.repeat
    lodsb
    .break .if al
.untilcxz
mov esi,offset cp_Ok
.if al
    mov esi,offset cp_Fail
.endif
ret

Quote
---------------------------------------------------------
1473    cycles for XMM/PSHUFB - I shot : Ok..
1024    cycles for XMM/PSHUFB - II shot : Ok..
---------------------------------------------------------
1145    cycles for XMM/PSHUFB - I shot : Ok..
1024    cycles for XMM/PSHUFB - II shot : Ok..
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 10:09:04 AM
I think it does what it suppose to do

Well, a good test should start with 4096 dword initialized with 00000001h
and then use the single routines with it, testing if at the end there are all 01h
in the Dest buffer, and to verify it you could use your routine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 26, 2012, 10:29:39 AM
Hi Frank,

Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.

Hope it helps,
Jochen
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 11:54:27 AM
Well, a good test should start with 4096 dword initialized with 00000001h
and then use the single routines with it, testing if at the end there are all 01h
in the Dest buffer, and to verify it you could use your routine.

This was implemented using macros for each test. You need to reset the byte buffer (Dest) for each test, using 0 if source is 1, or 1 if source is 0 (as in this case).
Code: [Select]
test_start macro
call SetBuffer
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text
call TestBuffer
print esi
endm
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 05:49:25 AM
This was implemented using macros for each test. You need to reset the byte buffer (Dest) for each test, using 0 if source is 1, or 1 if source is 0 (as in this case).

Yes nidud, thanks.

Hi Frank,

Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.

Hope it helps,
Jochen

Grazie Jochen, il tuo aiuto è sempre benvenuto.

I'll give it a look as soon as I finish a couple of prelimary
things I'm working on.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:15:37 AM
Is there an opcode to compare two xmm register to verify
if they have the same content?

Again SIMD instructions are a bit tricky for simple instructions.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 27, 2012, 09:24:01 AM
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer’s Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:37:57 AM
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer’s Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)

Yes qWord,

Let's assume I use:

Code: [Select]
   PCMPEQD xmm0,xmm1

considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

PTEST affect the Zero Flag, but the opcode is out of my league (SSE4.1).

 
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 10:17:09 AM
The first correct test for SSE instructions with proc to check the results:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13862   cycles for MOV AX - Test OK
13114   cycles for LEA - Test OK
6195    cycles for MMX/PUNPCKLBW - Test OK
3157    cycles for XMM/PSHUFB - I shot - Test OK
2375    cycles for XMM/PSHUFB - II shot - Test OK
12327   STOSB - Test OK
---------------------------------------------------------
9238    cycles for MOV AX - Test OK
8723    cycles for LEA - Test OK
4130    cycles for MMX/PUNPCKLBW - Test OK
3150    cycles for XMM/PSHUFB - I shot - Test OK
2375    cycles for XMM/PSHUFB - II shot - Test OK
16701   STOSB - Test OK
---------------------------------------------------------

--- ok ---

Attached last version.

Enjoy
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 27, 2012, 11:12:38 AM
Seems to be possible to compare the low 8 bytes:
Code: [Select]
COMISD dest,source

The destination operand is an XMM register.
The source can be either an XMM register or a memory location.

The flags are set according to the following rules:
Result Flags  Values
Unordered ZF,PF,CF  111
Greater than ZF,PF,CF  000
Less than ZF,PF,CF  001
Equal ZF,PF,CF  100

Maybe it's possible to shift (or rotate) the regs and then compare the high 8 bytes?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:25:12 AM
Seems to be possible to compare the low 8 bytes:
Code: [Select]
COMISD dest,source

The destination operand is an XMM register.
The source can be either an XMM register or a memory location.

The flags are set according to the following rules:
Result Flags  Values
Unordered ZF,PF,CF  111
Greater than ZF,PF,CF  000
Less than ZF,PF,CF  001
Equal ZF,PF,CF  100

Maybe it's possible to shift (or rotate) the regs and then compare the high 8 bytes?

Probably there are many ways to do it in more than 1 step.
I'm trying to find a single SIMD instruction, like PTEST, for the task
included in level SSE3.
Some more checking and I'll see.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 27, 2012, 11:27:06 AM
Let's assume I use:

Code: [Select]
   PCMPEQD xmm0,xmm1

considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:48:54 AM
Let's assume I use:

Code: [Select]
   PCMPEQD xmm0,xmm1

considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax

Thanks Jochen, I'll arrange a new algo to test with your
suggestion.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:37:20 PM
I wrote a new CheckDestX PROC to use Jochen suggestion:
Code: [Select]
; -----------------------------------------------------------------------------------------------
CheckDestX proc

    lea eax, Dest
    mov ebx, 32323232h
   
    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in edx
    test edx, edx   

    jne CheckErr
   
       
    add eax, 16
    dec ecx
    jnz @B
 
 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret
 
CheckDestX endp

It gives the same results as CheckDest PROC and
probably is quite fast, but I didn't still test the performance of it.

But I'm still not satisfied from CPUID results:
Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13876   cycles for MOV AX - Test OK
8740    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12336   STOSB - Test OK
---------------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12330   STOSB - Test OK
---------------------------------------------------------

--- ok ---

This time I've used PrintCpu and MasmBasic include,
but the results are still not accurate. My PC has SSSE3
capability, not SSE4.

Only Alex's code that I used a couple of year ago gives
a more accurate result:
Code: [Select]
┌─────────────────────────────────────────────────────────────[27-Nov-2012 at 10:57 GMT]─┐
│OS  : Microsoft Windows 7 Ultimate Edition, 64-bit Service Pack 1 (build 7601)          │
│CPU : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz with 2 logical core(s) with SSSE3           │

 I've read the thread about the CPUID code, but didn't find anything new.
Should I still use Alex's code or there is a more accurate routine for modern
CPU?

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 10:16:07 PM
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 10:37:52 PM
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on

Yes Dave, the reasoning is quite fair.
I'm talking about the uncorrect data shown by old routines
while we have newer routines, like Alex's one, that are more
accurate, even if they don' go above SSE4.X.
Jochen's library is quite up to date and uses many SSE opcode [I imagine]
but the Macro [I think] PrintCpu should be updated to be more
correct, doesn't matter if it doesn't cover last AVX code or the like.

Well it is just my opinion, of course. Even the CPUID utility that Intel gives us
http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838 (http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838)
doesn't show that my PC has SSSE3 capabilities, but at least it doesn't say I have
SSE4.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 10:59:01 PM
oh - i see what you mean
well - there have been a few that report erroneously
but, to programatically determine if a specific extension is supported is pretty easy
i.e., i wouldn't use "Alex's" or "Jochen's" or even "Dave's" routine
their purpose is to identify the CPU and capabilities, primarily for forum comparisons

that is a different function than identifying extension support for a program to select routines
what you want to do is actually much simpler   :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:33:13 PM
Code: [Select]
;               0_1 values come from CPUID function 1
;               8_1 values come from CPUID function 80000001h
;
;                Source        Description
;
;                0_1edx:23     MMX
;                8_1edx:22     MMX+    (AMD only)
;                8_1edx:31     3DNow!  (AMD only)
;                8_1edx:30     3DNow!+ (AMD only)
;                0_1edx:25     SSE
;                0_1edx:26     SSE2
;                0_1ecx:00     SSE3
;                0_1ecx:09     SSSE3
;                0_1ecx:19     SSE4.1
;                0_1ecx:20     SSE4.2  (Intel only)
;                8_1ecx:06     SSE4a   (AMD only)
;                8_1ecx:11     SSE5    (AMD only) - this became one of the AVX feature bits

you can get most of what you want to know by examining ECX and EDX after this...
Code: [Select]
        mov     eax,1
        cpuid
for example, ECX bit 0 will be 1 if SSE3 is supported
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:40:19 PM
Thanks Dave.

CPUID is still an unknown land, I've never been in those bit-area.
Your introduction to the matter looks interesting, I'll give it a try.  :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:41:13 PM
i updated it a little Frank - you may want to reload the page   :P

oh - and you have to use .586 or higher  to use CPUID   :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:46:28 PM
I SEE SSE on the SEASHORE  :icon_eek: 8)
Good to know.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:50:06 PM
say that 5 times real fast   :lol:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 12:59:33 AM
Code: [Select]
psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax


This code is a little bit faster on my Core 2 duo:
Code: [Select]
    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0 

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 28, 2012, 02:01:18 AM
Code: [Select]
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm ; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]

MAIN_COUNT = 3
LOOP_COUNT = 2000

.data

align 16
v1 dd 0,1,0,1
v2 dd 1,0,1,0

.code
start:
push 1
call ShowCpu ; print brand string and SSE level
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text, 13, 10
endm

;----------------------------------------------

test_start
mov edx,offset v1
mov ebx,offset v2
mov ecx,LOOP_COUNT
movdqa xmm1,[ebx]
    @@:
movdqa xmm0,[edx]
pcmpeqd xmm0,xmm1
pmovmskb eax,xmm0
movdqa xmm0,[ebx]
pcmpeqd xmm0,xmm1
pmovmskb eax,xmm0
; cmp ax,0FFFFh
dec ecx
jnz @b
test_end "cycles for XMM/pcmpeqd"

;----------------------------------------------

test_start
mov edx,offset v1
mov ebx,offset v2
mov ecx,LOOP_COUNT
movdqa xmm1,[ebx]
    @@:
movdqa xmm0,[edx]
psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
movdqa xmm0,[ebx]
psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
; test eax,eax
dec ecx
jnz @b
test_end "cycles for XMM/psubd"

print "---------------------------------------------------------", 13, 10
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:
inkey chr$(13, 10, "--- ok ---", 13)
exit

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

end start

Quote
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
4010    cycles for XMM/pcmpeqd
4069    cycles for XMM/psubd
---------------------------------------------------------
4012    cycles for XMM/pcmpeqd
4062    cycles for XMM/psubd
---------------------------------------------------------
4010    cycles for XMM/pcmpeqd
4041    cycles for XMM/psubd
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 07:26:30 AM
Well nidud  :t

this seems to work as well as psubd, at the same performance.
So we have a couple of alternatives, at least.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: habran on November 28, 2012, 08:50:00 AM
nidud's code:

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
---------------------------------------------------------
2988    cycles for XMM/pcmpeqd
3004    cycles for XMM/psubd
---------------------------------------------------------
2987    cycles for XMM/pcmpeqd
3012    cycles for XMM/psubd
---------------------------------------------------------
2978    cycles for XMM/pcmpeqd
3001    cycles for XMM/psubd
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 09:48:05 AM
Code: [Select]
----------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4144    cycles for MMX/PUNPCKLBW - Test OK
3158    cycles for XMM/PSHUFB - I shot - Test OK
2368    cycles for XMM/PSHUFB - II shot - Test OK
12328   cycles for STOSB - Test OK
2070    cycles for CheckDest - Test OK
547     cycles for CheckDestC - Test OK
544     cycles for CheckDestX - Test OK
----------------------------------------------------
9241    cycles for MOV AX - Test OK
8728    cycles for LEA - Test OK
4130    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2379    cycles for XMM/PSHUFB - II shot - Test OK
12335   cycles for STOSB - Test OK
2069    cycles for CheckDest - Test OK
548     cycles for CheckDestC - Test OK
543     cycles for CheckDestX - Test OK
----------------------------------------------------

CheckDestC is nidud's code modified. For the CPU and SSE level
I used Alex's routine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 28, 2012, 05:53:33 PM
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.

Code: [Select]
; SSETEST.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;
; make:
; jwasm /c /coff ssetest.asm
; link /SUBSYSTEM:CONSOLE ssetest.obj
;
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm

AxCPUid_Print PROTO

MAIN_COUNT = 4
LOOP_COUNT = 4096/16

.data

align 16
mask1 dd  0004080Ch,01010101h,01010101h,01010101h     ; for PFHUFB
ptrmask dd  mask1
PtrDest dd  Dest
PtrSource dd  Source
CPU_Count dd  0

align 8
Check db  8  dup(20h),0,0,0,0
PtrCheck dd  Check

align 8
TestOK db  "Test OK ",0,0,0,0
align 8
TestERR db  "Test ERR",0,0,0,0

.data?

align 16
Dest db 4096 dup(?)
Source dd 4096 dup(?)

.code

start:

;-------------------------------------------------------------------------------
; Before starting the test, the Dest buffer is blanked and the source buffer
; is initialized with dwords of "X000" to make it possibile check of the results
; at the end of each tested algo.
;-------------------------------------------------------------------------------

call  BlankDest
call  InitSource
print "---------------------------------------------------------", 13, 10

;     PrintCpu
invoke AxCPUid_Print
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ecx,LOOP_COUNT
mov ebx,offset Source
mov eax,offset Dest
endm

test_end macro text
counter_end
mov CPU_Count, eax
call CheckDestX
print str$(CPU_Count), 9, text
print PtrCheck, 13, 10
call BlankDest
endm

;----------------------------------------------
if 1
test_start
mov esi,ebx
mov edi,eax
      @@:
test_stosb macro
lodsd
stosb
lodsd
stosb
lodsd
stosb
lodsd
stosb
endm
test_stosb
test_stosb
test_stosb
test_stosb
dec ecx
jnz @B
test_end "cycles for STOSB - "

;----------------------------------------------

test_start
      @@:
test_lea macro o_des, o_src
      mov dh,[ebx+o_src+12]
mov dl,[ebx+o_src+8]
shl edx,16
mov dh,[ebx+o_src+4]
mov dl,[ebx+o_src]
mov [eax+o_des],edx
endm
test_lea  0,0
test_lea  4,16
test_lea  8,32
test_lea 12,48
lea eax,[eax+16]
lea ebx,[ebx+64]
dec ecx
jnz @B
test_end "cycles for LEA - "

;----------------------------------------------

test_start
      @@:
test_mov_dx macro o_des, o_src
mov dl,[ebx+o_src]
mov dh,[ebx+o_src+4]
mov [eax+o_des],dx
mov dl,[ebx+o_src+8]
mov dh,[ebx+o_src+12]
mov [eax+o_des+2],dx
endm
test_mov_dx  0,0
test_mov_dx  4,16
test_mov_dx  8,32
test_mov_dx 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for MOV DX - "

;----------------------------------------------

test_start
mov esi,ebx
mov edi,eax
      @@:
test_mov_ax macro o_des, o_src
mov al,[esi+o_src]
mov ah,[esi+o_src+4]
mov [edi+o_des],ax
mov al,[ebx+o_src+8]
mov ah,[ebx+o_src+12]
mov [edi+o_des+2],ax
endm
test_mov_ax  0,0
test_mov_ax  4,16
test_mov_ax  8,32
test_mov_ax 12,48
add esi,64
add edi,16
dec ecx
jnz @B
test_end "cycles for MOV AX - "

;----------------------------------------------

test_start
      @@:
test_punpcklbw macro o_des, o_src
movd mm0,dword ptr [ebx+o_src]
movd mm1,dword ptr [ebx+o_src+4]
movd mm2,dword ptr [ebx+o_src+8]
movd mm3,dword ptr [ebx+o_src+12]
punpcklbw mm0,mm2
punpcklbw mm1,mm3
punpcklbw mm0,mm1
movd dword ptr [eax+o_des],mm0
endm
test_punpcklbw  0,0
test_punpcklbw  4,16
test_punpcklbw  8,32
test_punpcklbw 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for MMX/PUNPCKLBW - "
endif
;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm1,[edx]
      @@:
test_pshufb0 macro o_des, o_src
movdqa xmm0,[ebx+o_src]
pshufb xmm0,xmm1
movd dword ptr [eax+o_des],xmm0
endm
test_pshufb0  0,0
test_pshufb0  4,16
test_pshufb0  8,32
test_pshufb0 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for XMM/PSHUFB - xmm0,xmm1 - "

;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm2,[edx]
      @@:
test_pshufb macro o_des, o_src
movdqa xmm1,[ebx+o_src]
pshufb xmm1,xmm2
movd dword ptr [eax+o_des],xmm1
endm
test_pshufb  0,0
test_pshufb  4,16
test_pshufb  8,32
test_pshufb 12,48
add ebx,64
add eax,16
dec ecx
jnz @B
test_end "cycles for XMM/PSHUFB - I shot - "

;----------------------------------------------

test_start
mov edx,ptrmask
movdqa xmm4,[edx]
      @@:
movdqa xmm0, [ebx]
pshufb xmm0, xmm4
movdqa xmm1, [ebx + 16]
pshufb xmm1, xmm4
movdqa xmm2, [ebx + 32]
pshufb xmm2, xmm4
movdqa xmm3, [ebx + 48]
pshufb xmm3, xmm4
movd dword ptr [eax], xmm0
movd dword ptr [eax + 4], xmm1
movd dword ptr [eax + 8], xmm2
movd dword ptr [eax + 12], xmm3
add ebx, 64
add eax, 16
dec ecx
jnz @b
test_end "cycles for XMM/PSHUFB - II shot - "

;----------------------------------------------
if 0
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDest

counter_end

    mov CPU_Count, eax

    print str$(CPU_Count), 9, "cycles for CheckDest - "
    print PtrCheck, 13, 10

;----------------------------------------------

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDestC

counter_end

    mov CPU_Count, eax

    print str$(CPU_Count), 9, "cycles for CheckDestC - "
    print PtrCheck, 13, 10


;----------------------------------------------

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

    call CheckDestX

counter_end

    mov CPU_Count, eax
    call CheckDestX
print str$(CPU_Count), 9, "cycles for CheckDestX - "
    print PtrCheck, 13, 10
endif
    print "---------------------------------------------------------", 13, 10

;----------------------------------------------
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:

inkey chr$(13, 10, "--- ok ---", 13)
exit



; -----------------------------------------------------------------------------------------------
BlankDest proc

    lea eax, Dest
    mov ebx, 20202020h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0


 @@:

    movdqa [eax], xmm0
    add eax, 16
    dec ecx
    jnz @B

    ret

BlankDest endp

; -----------------------------------------------------------------------------------------------
InitSource proc

    lea eax, Source
    mov ebx, 20202032h

    mov ecx, (4096/4)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0


 @@:

    movdqa [eax], xmm0
    add eax, 16
    dec ecx
    jnz @B
 ret

InitSource endp

; -----------------------------------------------------------------------------------------------
CheckDest proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/4)

 @@:

     mov  edx, [eax]
     cmp  edx, ebx


    jne CheckErr


    add eax, 4
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDest endp

; -----------------------------------------------------------------------------------------------
CheckDestX proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    psubd xmm1, xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0

    jne CheckErr


    add eax, 16
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDestX endp

; -----------------------------------------------------------------------------------------------
CheckDestC proc

    lea eax, Dest
    mov ebx, 32323232h

    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

 @@:

    movdqa xmm1, [eax]

    pcmpeqd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0FFFFh

    jne CheckErr

    add eax, 16
    dec ecx
    jnz @B

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

CheckDestC endp


;#############################################################################
; Instructions detection code by Alex aka Antariy
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

AxCPUid_Features struct DWORD
Is486capable dd ?
IsP1 dd ?
IsP1MMX dd ?
IsPPro dd ?
IsSSE1 dd ?
IsSSE2 dd ?
IsSSE3 dd ?
IsSSSE3 dd ?
IsSSE41 dd ?
IsSSE42 dd ?
BrandName db 64 dup (?)
AxCPUid_Features ends

AxCPUidCodeSizeStart EQU $


; Return values:
; Not zero - if entire structure was filled
; 0 - if CPUID is not supported or not entire struct beed filled.
;     For clearness, check the Is486capable - if 0 then CPUID is not supported,
;     otherwise CPU family less than PIV, but structure are filled properly.

align 16
AxCPUid_FillStructure proc lpAxCPUid_Features:PTR AxCPUid_Features
.data?
ifrunnedAndSupportCPUid dd ?
.code

push ebp
push ebx

xor eax,eax
mov ecx,sizeof AxCPUid_Features

mov ebp,[esp+4+8]

@@:
mov [ebp+ecx-4].AxCPUid_Features.Is486capable,eax
add ecx,-4
jnz @B

mov ebx,ifrunnedAndSupportCPUid
dec ebx
jz @done
dec ebx
jz @is486capable

inc ifrunnedAndSupportCPUid

pushfd
pop ecx
mov ebx,ecx
xor ebx,200000h ; switch ID flag
push ebx
popfd ; change EFLAGS
pushfd
pop ebx
xor ebx,ecx ; if previous ID bit was equal to current
test ebx,200000h; then it would be dropped
jz @done ; otherwise - it set

inc ifrunnedAndSupportCPUid

@is486capable:

or [ebp].AxCPUid_Features.Is486capable,1

xor eax,eax
cpuid
test eax,eax
jz @done

mov eax,1
cpuid

shr edx,5 ; has RDTSC - so, this is PI at least
adc [ebp].AxCPUid_Features.IsP1,0

shr edx,2 ; has PAE - so, PPro at least
adc [ebp].AxCPUid_Features.IsPPro,0

shr edx,17 ; has MMX
adc [ebp].AxCPUid_Features.IsP1MMX,0

shr edx,2 ; has SSE1
adc [ebp].AxCPUid_Features.IsSSE1,0

shr edx,1 ; has SSE2
adc [ebp].AxCPUid_Features.IsSSE2,0



shr ecx,1 ; has SSE3
adc [ebp].AxCPUid_Features.IsSSE3,0

shr ecx,9 ; has SSSE3
adc [ebp].AxCPUid_Features.IsSSSE3,0

shr ecx,10 ; has SSE4.1
adc [ebp].AxCPUid_Features.IsSSE41,0

shr ecx,1 ; has SSE4.2
adc [ebp].AxCPUid_Features.IsSSE42,0


; get CPU brand name, if exist


; fix for PIII, not SSE2 capable CPU cannot have brand name
cmp [ebp].AxCPUid_Features.IsSSE2,0
jz @done

mov eax,80000000h
cpuid
;pushad
;print hex$(eax),9,"Debug message: extended functions count return",13,10,13,10
;popad
add eax,eax ; check for zero and no any extended functions
jz @done

lea ebp,[ebp].AxCPUid_Features.BrandName
cmp eax,8 ; needed at least 80000004h function
mov eax,0
jb @done

push esi
mov esi,80000002h
@@:
mov eax,esi
cpuid
inc esi
mov [ebp],eax
mov [ebp+4],ebx
mov [ebp+8],ecx
mov [ebp+12],edx
add ebp,16
cmp esi,80000005h
jb @B
pop esi

or eax,1 ; in case of terminated brand string

@done:
pop ebx
pop ebp
ret 4
AxCPUid_FillStructure endp


; Return values:
; 0 - need upgrade
; above 0 - then supported:
; 1 - MMX
; 2 - SSE1
; 3 - SSE2
; 4 - SSE3
; 5 - SSSE3
; 6 - SSE4.1
; 7 - SSE4.2
align 16
AxCPUid_Print proc
push ebx
push esi
push edi
add esp,-1028

invoke GetStdHandle,STD_OUTPUT_HANDLE
xchg eax,ebx

invoke AxCPUid_FillStructure,esp
cmp [esp].AxCPUid_Features.Is486capable,0
jnz @F

push eax
mov edx,esp

push 0
push edx
push sizeof @@needupgrade
push offset @@needupgrade
push ebx
call WriteFile
pop ecx
xor eax,eax
jmp @done

@@:

mov esi,esp

mov edi,[esi].AxCPUid_Features.IsP1MMX
add edi,[esi].AxCPUid_Features.IsSSE1
add edi,[esi].AxCPUid_Features.IsSSE2
add edi,[esi].AxCPUid_Features.IsSSE3
add edi,[esi].AxCPUid_Features.IsSSSE3
add edi,[esi].AxCPUid_Features.IsSSE41
add edi,[esi].AxCPUid_Features.IsSSE42

mov eax,[esi].AxCPUid_Features.IsSSE42
lea eax,[eax*2+offset @@sse42]

mov edx,[esi].AxCPUid_Features.IsSSE41
lea edx,[edx*2+offset @@sse41]

mov ecx,[esi].AxCPUid_Features.IsSSSE3
lea ecx,[ecx*2+offset @@ssse3]

push eax
push edx
push ecx

mov eax,[esi].AxCPUid_Features.IsSSE3
lea eax,[eax*2+offset @@sse3]

mov edx,[esi].AxCPUid_Features.IsSSE2
lea edx,[edx*2+offset @@sse2]

mov ecx,[esi].AxCPUid_Features.IsSSE1
lea ecx,[ecx*2+offset @@sse1]

push eax
push edx
push ecx

mov eax,[esi].AxCPUid_Features.IsP1MMX
lea eax,[eax*2+offset @@mmx]

lea edx,[esi].AxCPUid_Features.BrandName
cmp dword ptr [edx],0
jnz @hasbrandname

mov ecx,[esi].AxCPUid_Features.Is486capable
add ecx,[esi].AxCPUid_Features.IsP1
add ecx,[esi].AxCPUid_Features.IsP1MMX
add ecx,[esi].AxCPUid_Features.IsPPro

cmp ecx,4
jb @F
add ecx,[esi].AxCPUid_Features.IsP1MMX ; PII is PPro with MMX
add ecx,[esi].AxCPUid_Features.IsSSE1 ; PIII is PPro with MMX and SSE1
@@:

mov edx,[ecx*4+offset @@earlycpus]

@hasbrandname:
add edx,1
cmp byte ptr [edx-1]," "
jz @hasbrandname

dec edx

push eax
push edx

push offset @@fmtFeatures
push esi
call wsprintf

mov edx,esp

invoke WriteFile,ebx,esi,eax,edx,0

add esp,10*4

xchg eax,edi

@done:
add esp,1028
pop edi
pop esi
pop ebx
ret

even
@@needupgrade db "This is time for upgrade indeed, i386 or early i486 :)"
even
@@i486 db "Old-good 80486",0
even
@@p1 db "Pentium",0
even
@@pmmx db "Pentium with MMX Technology",0
even
@@ppro db "Pentium Pro",0
even
@@p2 db "Pentium II",0
even
@@p3 db "Pentium III",0

align 4
@@earlycpus dd 0
dd offset @@i486
dd offset @@p1
dd offset @@pmmx
dd offset @@ppro
dd offset @@p2
dd offset @@p3

even
@@fmtFeatures db "%s",13,10,13,10
db "Instructions: %s%s%s%s%s%s%s",13,10,0
even
@@mmx db 0,0
db "MMX",0
even
@@sse1 db 0,0
db ", SSE1",0
even
@@sse2 db 0,0
db ", SSE2",0
even
@@sse3 db 0,0
db ", SSE3",0
even
@@ssse3 db 0,0
db ", SSSE3",0
even
@@sse41 db 0,0
db ", SSE4.1",0
even
@@sse42 db 0,0
db ", SSE4.2",0


AxCPUid_Print endp
AxCPUidCodeSize EQU $-AxCPUidCodeSizeStart

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;#############################################################################

end start

Quote
---------------------------------------------------------
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
14854   cycles for STOSB - Test OK
9466    cycles for LEA - Test OK
7749    cycles for MOV DX - Test OK
7776    cycles for MOV AX - Test OK
4565    cycles for MMX/PUNPCKLBW - Test OK
2074    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
2045    cycles for XMM/PSHUFB - I shot - Test OK
2258    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14823   cycles for STOSB - Test OK
9056    cycles for LEA - Test OK
7850    cycles for MOV DX - Test OK
7787    cycles for MOV AX - Test OK
4672    cycles for MMX/PUNPCKLBW - Test OK
2013    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
2028    cycles for XMM/PSHUFB - I shot - Test OK
2014    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14836   cycles for STOSB - Test OK
8997    cycles for LEA - Test OK
7851    cycles for MOV DX - Test OK
7784    cycles for MOV AX - Test OK
4748    cycles for MMX/PUNPCKLBW - Test OK
1992    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
1137    cycles for XMM/PSHUFB - I shot - Test OK
2006    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
14947   cycles for STOSB - Test OK
9250    cycles for LEA - Test OK
7838    cycles for MOV DX - Test OK
7791    cycles for MOV AX - Test OK
4565    cycles for MMX/PUNPCKLBW - Test OK
1985    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
1984    cycles for XMM/PSHUFB - I shot - Test OK
2034    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------

With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.

Edit: renamed test_pshufb to test_pshufb0
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: habran on November 28, 2012, 08:02:29 PM
last nidud's code produce this on my laptop:

Code: [Select]
---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
6675    cycles for STOSB - Test OK
4240    cycles for LEA - Test OK
3353    cycles for MOV DX - Test OK
3276    cycles for MOV AX - Test OK
1924    cycles for MMX/PUNPCKLBW - Test OK
1213    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
1539    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6093    cycles for STOSB - Test OK
3806    cycles for LEA - Test OK
3403    cycles for MOV DX - Test OK
3277    cycles for MOV AX - Test OK
1945    cycles for MMX/PUNPCKLBW - Test OK
808     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
904     cycles for XMM/PSHUFB - I shot - Test OK
1490    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3668    cycles for MOV DX - Test OK
3684    cycles for MOV AX - Test OK
3044    cycles for MMX/PUNPCKLBW - Test OK
888     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
901     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3240    cycles for MOV DX - Test OK
3255    cycles for MOV AX - Test OK
2527    cycles for MMX/PUNPCKLBW - Test OK
833     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
858     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 10:51:37 PM
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.

With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.

Edit: renamed test_pshufb to test_pshufb0

Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.
I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't.  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 28, 2012, 11:05:21 PM
The PMOVMSKB does or doesn't detect the difference?

It does. Launch some tests with Olly to see what happens. Anyway, PCM*** does the same job as PSUBD, and they are equally fast (e.g. one cycle on my AMD).
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 12:22:22 AM
I compare two XMM register, with one of them greater
than the other.
According to this test, with PSUBD it doesn't detect it ::)
Code: [Select]
------------------------------------
Test on PCMPEQD - Test ERR
------------------------------------
Test on PSUBD   - Test OK
------------------------------------

Press any key to continue ...

This is the code I used. Did I make any error?

Code: [Select]
; ---------------------------------------------------------------------
; TEST_PSUBD.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;-------------------------------------------------------------------------------
; Test the difference between PCMPEQD and PSUBD when comparing two XMM
; registers.
; 28/Nov/2012 - MASM FORUM - frktons
;-------------------------------------------------------------------------------



.nolist
include \masm32\include\masm32rt.inc
.686
.xmm


.data

align 8
Check db  8  dup(20h),0,0,0,0
PtrCheck dd  Check

align 8
TestOK db  "Test OK ",0,0,0,0
align 8
TestERR db  "Test ERR",0,0,0,0


.code

start:


print "---------------------------------------------------------", 13, 10
      print "Test on PCMPEQD - "
      call  PCMP_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10

      print "Test on PSUBD   - "
      call  PSUB_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10, 13, 10
      inkey

      exit
     
; -----------------------------------------------------------------------------------------------
PSUB_TEST proc


    mov ebx, 32323232h
    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    psubd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0

    jne CheckErr

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

PSUB_TEST endp

; -----------------------------------------------------------------------------------------------
PCMP_TEST proc


    mov ebx, 32323232h

    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    pcmpeqd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0FFFFh

    jne CheckErr

 CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

 CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

 EndCheck:

    ret

PCMP_TEST endp


end start
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 01:08:17 AM
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 01:44:12 AM
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h


So what is my error? I was aware that the logic is inverted
and I tested:
Code: [Select]
    cmp    dx, 0

    jne CheckErr
for PSUBD, and

Code: [Select]
    cmp   dx, 0FFFFh

    jne CheckErr

for PCMPEQD. ::)
 
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 03:12:22 AM
It seems pcmpeqb returns always zero, unless the xmm bytes are FFh...
Code: [Select]
---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008641903833650
xmm0            3617008641903833650
pcmpeqd out     xmm1            -1

pmovmskb
xmm1            -1
edx             65535
Test OK
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008641903833650
xmm0            3617008641903833650
PSubD out       xmm1            0

pmovmskb
xmm1            0
dx              0
Test OK
---------------------------------------------------------

---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008646198800947
xmm0            3617008641903833650
pcmpeqd out     xmm1            0

pmovmskb
xmm1            0
edx             0
Test ERR
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008646198800947
xmm0            3617008641903833650
PSubD out       xmm1            4294967297

pmovmskb
xmm1            4294967297
dx              0
Test OK
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 04:17:03 AM
Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.

This:
Code: [Select]
mov al,[esi]
mov dl,[esi+4]
mov cl,[esi+8]
mod bl,[esi+12]
mov [edi],al
mov [edi+1],dl
mov [edi+2],cl
mov [edi+3],bl
Is faster than this:
Code: [Select]
mov ecx,4
@@:
mov al,[esi]
mov [edi],al
add esi,4
add edi,1
dec ecx
jnz @B
Since the loop itself will add extra time to the test.

To even the result, the extra loop was removed:
Code: [Select]
mov al,[esi]
mov [edi],al
mov al,[esi+4]
mov [edi+1],al
mov al,[esi+8]
mov [edi+2],al
mov al,[esi+12]
mov [edi+3],al

I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't.  ::)

Hmm, it seems to be something wrong with this logic.

What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?

If so, the result is 00000000000000000000000000000000h in the first, and 0000000000000000h in the latter.

The cmp function returns 16 bits representing the result
If equal the result is FFFF.

Maybe you have to use the cmp function after all.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 04:37:01 AM
Maybe you could use CMPNEQPS
The result should then be zero if equal
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:41:35 AM
When I read the Intel Manuals, about PMOVMSKB
I found something didn't match with the possibility to
compare two XMM register for equality:
Code: [Select]
Creates a mask made up of the most significant bit of each byte of the source
operand (second operand) and stores the result in the low byte or word of the destination
operand (first operand).
If only the MSBits are stored into the destination operand, and the difference is in other
bits, it will not be detected.
So My idea is that after PSUBD we have to use a different opcode to
detect is there are differences other then in the MSBits of the xmm we are testing.

On the other side, using PCMPEQD we can test both the equality and the difference
between the xmm registers, using PMOVMSKB.
This is what I've undestood so far.
Using PSUBD is a smart solution but it need to be followed by something
different than PMOVMSKB, in my opinion.

So far nidud's solution is the one I understand. Waiting for some other solution.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:43:35 AM
Maybe you could use CMPNEQPS
The result should then be zero if equal


Yes, probably this opcode will work as well.

Quote
What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?

yes again. So far I think the PCMPEQD variant is the complete one
for testing equality. Something is missing, in my opinion for PSUBD.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:54:25 AM
What do you want to compare? FP or integer value?

xmm integer packed data is what I'd like to compare.

I want to know, for example, if xmm0 is equal to xmm1.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:00:07 AM
What do you want to compare? FP or integer value?

xmm integer packed data is what I'd like to compare.

I want to know, for example, if xmm0 is equal to xmm1.
so, you are rigth with PCMPEQxx + PMOVMSKB.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 05:02:03 AM
What do you want to compare? FP or integer value?

Test for equality only, so FP or INT won't make a difference. Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc). In any case, PCMPEQxx is the right choice, as qWord already wrote.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:07:05 AM
Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc).
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 05:13:34 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.

Grazie, good to know :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:15:49 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.

Grazie, good to know :t
ups..., that applies for the unordered compare, for CMPEQPS it allways return false.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 05:24:12 AM
I'm glad to see everybody agreed eventually.  :t

Now let's go further.

How do I compare for greater than?
 :P

Same registers, same data type.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 29, 2012, 06:18:43 AM
reverse the operands ?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 06:20:50 AM
Now let's go further.

How do I compare for greater than?
 :P

Same registers, same data type.
PCMPGTD  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 06:22:44 AM
Now let's go further.

How do I compare for greater than?
 :P

Same registers, same data type.
PCMPGTD  ::)

Thanks qWord, maybe this time I'll take less time to get the info.
Well it looks quite simple to manage:
Quote
If a data element in the destination operand is greater
than the corresponding date element in the source operand, the corresponding data
element in the destination operand is set to all 1s; otherwise, it is set to all 0s.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 08:32:39 AM
To conclude:

pcmpeqb compares 16 bytes, pcmpeqd compares 4 doublewords, result in xmm0 is the same.
pcmpgtb compares 16 bytes, pcmpgtd compares 4 doublewords, result in xmm0 is not the same.

Code: [Select]
pcmpeqb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if equal
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,0FFFFh
je is_equal

pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 08:52:13 AM
Hmm, maybe:
Code: [Select]
cmp ax,8000h
jae is_great
..
cmp ax,-1
jle is_great
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 09:52:49 AM
To conclude:

pcmpeqb compares 16 bytes, pcmpeqd compares 4 doublewords, result in xmm0 is the same.
pcmpgtb compares 16 bytes, pcmpgtd compares 4 doublewords, result in xmm0 is not the same.

Code: [Select]
pcmpeqb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if equal
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,0FFFFh
je is_equal

pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great


I don't think so about the test for GT. So far I got:

when we use pcmpgtb
we have FF only in the bytes that are greater, not all of them,
and the same for the dword, with pcmpgtd,only the dword that are greater
are switched to FF, the remaining of them are switched to
00 if are equal or less than.

Instead of this:
Code: [Select]
pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great

Something like:

Code: [Select]
pcmpgtd xmm0,xmm1
pmovmskb eax,xmm0
        .if bit ax, 15
            jmp IsGreater
        .endif

If bit 15 not 1 The way is longer, and we have to test
other things:

Code: [Select]
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsGreater;  The second original value tested
        .endif
        .if ax == bx
            jmp AreEqual
        .elseif ax > bx
            jmp IsGreater
        .else
            jmp IsLessThan
        .endif

Not already tested, but this is the idea.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 10:59:37 AM
we have FF only in the bytes that are greater, not all of them,
and the same for the dword, with pcmpgtd,only the dword that are greater

The upper byte should always be the same (FF,? or FFFFFFFF,? if greater)

Code: [Select]
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsGreater;  The second original value tested
        .endif
        .if ax == bx
            jmp AreEqual
        .else
            jmp IsLessThan; The second original value tested
        .endif
Not already tested, but this is the idea.

assuming ax is zero, I think that is correct

same as:
Code: [Select]
test ah,80h
jnz is_great
test ax,ax
jz is_equal
jmp is_less
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 11:07:16 AM

The upper byte should always be the same (FF,? or FFFFFFFF,? if greater)

Yes and not. It depends if the byte or dword tested is greater, not the whole xmm register.
But if upper byte or word or dword is FF, you know the entire xmm register is greater, otherwise
you have to check other things.

Code: [Select]
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsGreater;  The second original value tested
        .endif
        .if ax == bx
            jmp AreEqual
        .else
            jmp IsLessThan; The second original value tested
        .endif
Not already tested, but this is the idea.

assuming ax is zero, I think that is correct
[/quote]

I modified the code, there was a logical error, have a look.
It should work with whatever value ax and bx assume.

Quote
same as:
Code: [Select]
test ah,80h
jnz is_great
test ax,ax
jz is_equal
jmp is_less


I think you have to use 2 registers not only ax. According to my understanding
you cannot check if greater, equal or less than with a single passage. Only
if you are lucky you can find the answer in the first check, if the upper byte is FF
you can say xmm0 is greater than xmm1.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 11:46:50 AM
So the complete test for greater, less than or equal should
be something like this:

Code: [Select]
        pcmpgtd   xmm0,xmm1
pmovmskb eax,xmm0
        .if bit ax, 15
            jmp IsGreater
        .endif

pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsLessThan
        .endif
        .if ax == bx
            jmp AreEqual
        .elseif ax > bx
            jmp IsGreater
        .else
            jmp IsLessThan
        .endif

I'll try the code and let you know.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 12:00:35 PM
ahh, we dont know if it's equal or less  :redface:
Code: [Select]
movdqa xmm2,xmm0 ; save dest
pcmpeqb xmm0,xmm1 ; test equal first
pmovmskb eax,xmm0
cmp ax,-1
je is_equal
movdqa xmm0,xmm2
pcmpgtb xmm0,xmm1
pmovmskb eax,xmm0
test ah,80h
jnz is_great
jmp is_less

Edit: movq --> movdqa  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 12:15:18 PM
Read again my last posts, yes we can know if the
compare gives GT, LT or EQ.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 03, 2012, 08:39:08 AM
Using LEA in the test code seems not only to give random result, it also change according to the value of IP on entry to the test. I did some testing on the atol function, which uses LEA to multiply by 10, and I notice that by changing code not related to this function had an effect on the result. This is the test code used:
Code: [Select]
; ATOL.ASM--
; http://www.masm32.com/
;
; Test case for using LEA
;
; make:
; jwasm /c /coff atol.asm
; link /subsystem:console atol.obj
;
.xlist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm
.list

MAIN_COUNT = 2
LOOP_COUNT = 3000

atol1 proto string:dword
atol2 proto string:dword
atol3 proto string:dword

.data

align 16
;db 0
v1 db "65636",0
v2 db "2147483647",0

.code
start:
push 1
call ShowCpu ; print brand string and SSE level
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
align 16
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text, 13, 10
endm

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke atol3,addr v1
invoke atol3,addr v2
dec esi
jnz @b
test_end "cycles for atol LODSB"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke atol2,addr v1
invoke atol2,addr v2
dec esi
jnz @b
test_end "cycles for atol SHL"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke atol1,addr v1
invoke atol1,addr v2
dec esi
jnz @b
test_end "cycles for atol LEA"

;----------------------------------------------

print "---------------------------------------------------------", 13, 10
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:
inkey chr$(13, 10, "--- ok ---", 13)
exit

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

align 16

atol1 proc lpSrc:DWORD

    xor eax, eax ; zero EAX
    mov edx, lpSrc
    movzx ecx, BYTE PTR [edx]
    add edx, 1
    cmp ecx, "-" ; test for sign
    jne lbl0
    add eax, 1  ; set EAX if sign
    movzx ecx, BYTE PTR [edx]
    add edx, 1

  lbl0:
    push eax    ; store sign on stack
    xor eax, eax ; so eax*10 will be 0 for first digit
;-----------------------------------
; normal: 198057
if 1 ; makes it fast: 186000
nop
endif
; not using align 16:
if 0 ; makes it slow: 294520
nop
nop
nop
nop
nop
nop
nop
 if 0 ; makes it fast: 195044
 nop
 nop
 nop
 nop
 nop
 nop
 endif
endif
;-----------------------------------
  lbl1:
    sub ecx, 48
    jc  lbl2
    lea eax, [eax+eax*4] ; mul eax by 5
    lea eax, [ecx+eax*2] ; mul eax by 2 and add digit value
    movzx ecx, BYTE PTR [edx]   ; get next digit
    add edx, 1
    jmp lbl1

  lbl2:
    pop ecx      ; retrieve sign
    test ecx, ecx
    jnz lbl3
    ret

  lbl3:
    neg eax      ; negative return value is sign set
    ret
atol1 endp

align 4

atol2 proc uses ebx string:dword
mov ebx,string
sub ecx,ecx
      @@:
mov cl,[ebx]
inc ebx
cmp cl,' '
je @B
push ecx
cmp cl,'-'
je @F
cmp cl,'+'
jne atol_set
      @@:
mov cl,[ebx]
inc ebx
    atol_set:
    sub eax,eax
;-----------------------------------
; normal: 252233
if 1  ; : 237000
nop
nop
endif
;-----------------------------------
    atol_loop:
    sub cl,'0'
jc @F
mov edx,eax
shl eax,3
add eax,edx
add eax,edx
add eax,ecx
mov cl,[ebx]
inc ebx
jmp atol_loop
      @@:
pop edx
cmp dl,'-'
je atol_neg
    atol_end:
ret
    atol_neg:
neg eax
jmp atol_end
atol2 endp

align 16

atol3 proc uses esi string:dword
mov esi,string
sub eax,eax
      @@:
lodsb
cmp al,' '
je @B
push eax
cmp al,'-'
je @F
cmp al,'+'
jne atol_set
      @@:
lodsb
    atol_set:
    sub ecx,ecx
    atol_loop:
    sub al,'0'
jc @F
mov edx,ecx
shl ecx,3
add ecx,edx
add ecx,edx
add ecx,eax
lodsb
jmp atol_loop
      @@:
mov eax,ecx
pop edx
cmp dl,'-'
je atol_neg
    atol_end:
ret
    atol_neg:
neg eax
jmp atol_end
atol3 endp

end start

The first result from the test:
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
312187   cycles for atol LODSB
243102   cycles for atol SHL
294036   cycles for atol LEA
---------------------------------------------------------
312118   cycles for atol LODSB
244057   cycles for atol SHL
294086   cycles for atol LEA
---------------------------------------------------------
312117   cycles for atol LODSB
243131   cycles for atol SHL
294214   cycles for atol LEA
---------------------------------------------------------

I then aligned the code entry on the funtions, and tuned the actual loop code to get the best result:
Quote
---------------------------------------------------------
297479   cycles for atol LODSB
237504   cycles for atol SHL
195149   cycles for atol LEA
---------------------------------------------------------
297359   cycles for atol LODSB
237311   cycles for atol SHL
197737   cycles for atol LEA
---------------------------------------------------------

The order of the test code could then have an effect on the result.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 03, 2012, 09:37:11 AM
A test case for memcpy:
Code: [Select]
; MEMCPY.ASM--
; http://www.masm32.com/
;
; make:
; jwasm /c /coff memcpy.asm
; link /subsystem:console memcpy.obj
;
.xlist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm
.list

MAIN_COUNT = 2
LOOP_COUNT = 1000

memcpy proto :ptr byte, :ptr byte, :dword
memcpyxmm1 proto :ptr byte, :ptr byte, :dword
memcpyxmm2 proto :ptr byte, :ptr byte, :dword
memcpyxmm3 proto :ptr byte, :ptr byte, :dword

.data

align 16
b1 db 4096 dup(?)
b2 db 4096 dup(?)
db 1
b3 db 4096 dup(?)
b4 db 4096 dup(?)

.code
start:
push 1
call ShowCpu ; print brand string and SSE level
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text, 13, 10
endm

;----------------------------------------------
if 1
test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpy,addr b1,addr b2,4096
dec esi
jnz @b
test_end "cycles for memcpy A"
endif
;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmm1,addr b1,addr b2,4096
dec esi
jnz @b
test_end "cycles for memcpy movdqa xmm0 A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmm2,addr b1,addr b2,4096
dec esi
jnz @b
test_end "cycles for memcpy movdqu xmm0 A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmm3,addr b1,addr b2,4096
dec esi
jnz @b
test_end "cycles for memcpy movdqu xmm0..xmm7 A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmm2,addr b3,addr b4,4096
dec esi
jnz @b
test_end "cycles for memcpy movdqu xmm0 U"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmm3,addr b3,addr b4,4096
dec esi
jnz @b
test_end "cycles for memcpy movdqu xmm0..xmm7 U"

;----------------------------------------------

print "---------------------------------------------------------", 13, 10
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:
inkey chr$(13, 10, "--- ok ---", 13)
exit

align 16
memcpy proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
mov eax,edi
rep movsb
ret
memcpy endp

align 16
memcpyxmm1 proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
shr ecx,7
mov eax,edi
      @@:
movdqa xmm0,[esi]
movdqa [edi],xmm0
movdqa xmm0,[esi+16]
movdqa [edi+16],xmm0
movdqa xmm0,[esi+32]
movdqa [edi+32],xmm0
movdqa xmm0,[esi+48]
movdqa [edi+48],xmm0
movdqa xmm0,[esi+64]
movdqa [edi+64],xmm0
movdqa xmm0,[esi+80]
movdqa [edi+80],xmm0
movdqa xmm0,[esi+96]
movdqa [edi+96],xmm0
movdqa xmm0,[esi+112]
movdqa [edi+112],xmm0
add esi,128
add edi,128
dec ecx
jnz @B
ret
memcpyxmm1 endp

align 16
memcpyxmm2 proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
mov eax,ecx
shr eax,7
jz memcpyxmm2_tail
      @@:
movdqu xmm0,[esi]
movdqu [edi],xmm0
movdqu xmm0,[esi+16]
movdqu [edi+16],xmm0
movdqu xmm0,[esi+32]
movdqu [edi+32],xmm0
movdqu xmm0,[esi+48]
movdqu [edi+48],xmm0
movdqu xmm0,[esi+64]
movdqu [edi+64],xmm0
movdqu xmm0,[esi+80]
movdqu [edi+80],xmm0
movdqu xmm0,[esi+96]
movdqu [edi+96],xmm0
movdqu xmm0,[esi+112]
movdqu [edi+112],xmm0
add esi,128
add edi,128
dec eax
jnz @B
    memcpyxmm2_tail:
    and ecx,7Fh
rep movsb
    memcpyxmm2_end:
mov eax,s1
ret
memcpyxmm2 endp

align 16
memcpyxmm3 proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
shr ecx,7
mov eax,edi
      @@:
movdqu xmm0,[esi]
movdqu xmm1,[esi+16]
movdqu xmm2,[esi+32]
movdqu xmm3,[esi+48]
movdqu xmm4,[esi+64]
movdqu xmm5,[esi+80]
movdqu xmm6,[esi+96]
movdqu xmm7,[esi+112]
movdqu [edi],xmm0
movdqu [edi+16],xmm1
movdqu [edi+32],xmm2
movdqu [edi+48],xmm3
movdqu [edi+64],xmm4
movdqu [edi+80],xmm5
movdqu [edi+96],xmm6
movdqu [edi+112],xmm7
add esi,128
add edi,128
dec ecx
jnz @B
ret
memcpyxmm3 endp

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

end start

result:
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
609815   cycles for memcpy A
608249   cycles for memcpy movdqa xmm0 A
579394   cycles for memcpy movdqu xmm0 A
547453   cycles for memcpy movdqu xmm0..xmm7 A
1175825   cycles for memcpy movdqu xmm0 U
1011253   cycles for memcpy movdqu xmm0..xmm7 U
---------------------------------------------------------
610739   cycles for memcpy A
605121   cycles for memcpy movdqa xmm0 A
580058   cycles for memcpy movdqu xmm0 A
541764   cycles for memcpy movdqu xmm0..xmm7 A
1173293   cycles for memcpy movdqu xmm0 U
1010530   cycles for memcpy movdqu xmm0..xmm7 U
---------------------------------------------------------

Intel(R) Core(TM) i3 CPU    540  @ 3.07GHz (SSE4)
---------------------------------------------------------
449020  cycles for memcpy A
343031  cycles for memcpy movdqa xmm0 A
274136  cycles for memcpy movdqu xmm0 A
270389  cycles for memcpy movdqu xmm0..xmm7 A
481695  cycles for memcpy movdqu xmm0 U
484069  cycles for memcpy movdqu xmm0..xmm7 U
---------------------------------------------------------
417787  cycles for memcpy A
271078  cycles for memcpy movdqa xmm0 A
322979  cycles for memcpy movdqu xmm0 A
270214  cycles for memcpy movdqu xmm0..xmm7 A
427182  cycles for memcpy movdqu xmm0 U
420321  cycles for memcpy movdqu xmm0..xmm7 U
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 05:06:14 AM
Most programs these days use some kind of compression, sound, video, images, and the hacking of the memcpy() function generate problems in this case. Compressed data is often expanded using memcpy(data+size, data, count). If data is 'abcd', size is 1, and count is 4, the expected output is 'aaaaa'. Using movsd to improve speed will then be the hack which generate this problem.

I think it would be better to create different version of memcpy[w|d|q], since the user usually know the type of data he is copying. The idea to make one version to handle all cases will be complicated, and all the test code needed makes it rather big and also in some cases slower.

In the test I made above I was speculating if movdqu was faster than movdqa on aligned data, which seems a bit odd. I rewrote the test code, but the result is still random. The test (at least in this case) show how little gain there is using movsd to improve speed. Using SSE to copy aligned data have some benefits, but it is not a huge improvement.
Code: [Select]
; MEMCPY2.ASM--
; http://www.masm32.com/
;
; make:
; jwasm /coff memcpy2.asm
; link /subsystem:console memcpy2.obj
;
.xlist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm
.list

MAIN_COUNT = 2
LOOP_COUNT = 100
MAXMEMORY  = 40000h

memcpy proto :ptr byte, :ptr byte, :dword
memcpyd proto :ptr byte, :ptr byte, :dword
memcpyxmmA proto :ptr byte, :ptr byte, :dword
memcpyxmmU proto :ptr byte, :ptr byte, :dword

.data
a1 dd ?
m1 dd ?
u1 dd ?

.code
start:
invoke GlobalAlloc,GMEM_FIXED,MAXMEMORY+128
mov a1,eax
test eax,eax
jnz @F
exit
      @@:
mov edx,eax
and eax,not 128-1
mov m1,eax
inc edx
mov u1,edx

push 1
call ShowCpu ; print brand string and SSE level
print "---------------------------------------------------------", 13, 10

mov ecx,MAIN_COUNT
main_loop:
push ecx

test_start macro
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
endm

test_end macro text
counter_end
print str$(eax), 9, text, 13, 10
endm

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke crt_memcpy,m1,m1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for crt_memcpy A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpy,m1,m1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for memcpy A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyd,m1,m1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for memcpyd A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmmA,m1,m1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for memcpy movdqa A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmmU,m1,m1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for memcpy movdqu A"

;----------------------------------------------

test_start
mov esi,LOOP_COUNT
    @@:
invoke memcpyxmmU,u1,u1,MAXMEMORY
dec esi
jnz @b
test_end "cycles for memcpy movdqu U"

;----------------------------------------------

print "---------------------------------------------------------", 13, 10
pop ecx
dec ecx
jz @F
jmp main_loop
      @@:
      invoke GlobalFree,a1
inkey chr$(13, 10, "--- ok ---", 13)
exit

align 16

memcpy proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
mov eax,edi
rep movsb
ret
memcpy endp

align 16

memcpyd proc uses esi edi s1:ptr byte, s2:ptr byte, count:dword
mov edi,s1
mov esi,s2
mov ecx,count
shr ecx,2
mov eax,edi
rep movsd
ret
memcpyd endp

align 16

memcpyxmmA proc uses ebx s1:ptr byte, s2:ptr byte, count:dword
mov edx,s1
mov ebx,s2
mov eax,count
neg eax
add eax,127
align 16
      @@:
movdqa xmm0,[ebx]
movdqa xmm1,[ebx+16]
movdqa xmm2,[ebx+32]
movdqa xmm3,[ebx+48]
movdqa xmm4,[ebx+64]
movdqa xmm5,[ebx+80]
movdqa xmm6,[ebx+96]
movdqa xmm7,[ebx+112]
movdqa [edx],xmm0
movdqa [edx+16],xmm1
movdqa [edx+32],xmm2
movdqa [edx+48],xmm3
movdqa [edx+64],xmm4
movdqa [edx+80],xmm5
movdqa [edx+96],xmm6
movdqa [edx+112],xmm7
add ebx,128
add edx,128
add eax,128
jnc @B
mov eax,s1
ret
memcpyxmmA endp

align 16

memcpyxmmU proc uses ebx s1:ptr byte, s2:ptr byte, count:dword
mov edx,s1
mov ebx,s2
mov eax,count
neg eax
add eax,127
jbe memcpyxmmU_16
align 16
      @@:
movdqu xmm0,[ebx]
movdqu xmm1,[ebx+16]
movdqu xmm2,[ebx+32]
movdqu xmm3,[ebx+48]
movdqu xmm4,[ebx+64]
movdqu xmm5,[ebx+80]
movdqu xmm6,[ebx+96]
movdqu xmm7,[ebx+112]
movdqu [edx],xmm0
movdqu [edx+16],xmm1
movdqu [edx+32],xmm2
movdqu [edx+48],xmm3
movdqu [edx+64],xmm4
movdqu [edx+80],xmm5
movdqu [edx+96],xmm6
movdqu [edx+112],xmm7
add ebx,128
add edx,128
add eax,128
jnc @B
    memcpyxmmU_16:
    sub eax,127-15
jns memcpyxmmU_tail
      @@:
movdqu xmm0,[ebx]
movdqu [edx],xmm0
add ebx,16
add edx,16
add eax,16
jnc @B
    memcpyxmmU_tail:
    sub eax,15
jz memcpyxmmU_end
neg eax
    mov ecx,eax
xchg esi,ebx
xchg edi,edx
rep movsb
mov esi,ebx
mov edi,edx
    memcpyxmmU_end:
mov eax,s1
ret
memcpyxmmU endp

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

end start

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
5436621   cycles for crt_memcpy A
5451494   cycles for memcpy A
5430749   cycles for memcpyd A
5130181   cycles for memcpy movdqa A
5137260   cycles for memcpy movdqu A
9398746   cycles for memcpy movdqu U
---------------------------------------------------------
5424911   cycles for crt_memcpy A
5429803   cycles for memcpy A
5424371   cycles for memcpyd A
5147542   cycles for memcpy movdqa A
5139047   cycles for memcpy movdqu A
9419693   cycles for memcpy movdqu U
---------------------------------------------------------
Intel(R) Core(TM) i3 CPU    540  @ 3.07GHz (SSE4)
---------------------------------------------------------
3768758 cycles for crt_memcpy A
3601358 cycles for memcpy A
3611729 cycles for memcpyd A
3665437 cycles for memcpy movdqa A
3527944 cycles for memcpy movdqu A
4053850 cycles for memcpy movdqu U
---------------------------------------------------------
3910008 cycles for crt_memcpy A
3616456 cycles for memcpy A
3675379 cycles for memcpyd A
4250390 cycles for memcpy movdqa A
3348694 cycles for memcpy movdqu A
4051784 cycles for memcpy movdqu U
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:11:03 AM
my test for atol:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
429117  cycles for atol LODSB
350976  cycles for atol SHL
411231  cycles for atol LEA
--------------------------------------------------------
430242  cycles for atol LODSB
509282  cycles for atol SHL
395102  cycles for atol LEA
--------------------------------------------------------

well they are a bit random, as you said.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:20:42 AM
Memcpy:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
757223  cycles for memcpy A
288975  cycles for memcpy movdqa xmm0 A
1352024 cycles for memcpy movdqu xmm0 A
1367569 cycles for memcpy movdqu xmm0..xmm7 A
5668726 cycles for memcpy movdqu xmm0 U
4563076 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------
749649  cycles for memcpy A
302916  cycles for memcpy movdqa xmm0 A
1737163 cycles for memcpy movdqu xmm0 A
1841807 cycles for memcpy movdqu xmm0..xmm7 A
6136384 cycles for memcpy movdqu xmm0 U
4055501 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------

and the last routines:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
5825483 cycles for crt_memcpy A
7352188 cycles for memcpy A
7269901 cycles for memcpyd A
4146083 cycles for memcpy movdqa A
8700368 cycles for memcpy movdqu A
27212513        cycles for memcpy movdqu U
--------------------------------------------------------
6180618 cycles for crt_memcpy A
9100718 cycles for memcpy A
7028934 cycles for memcpyd A
4151090 cycles for memcpy movdqa A
11923657        cycles for memcpy movdqu A
29665081        cycles for memcpy movdqu U
--------------------------------------------------------

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:35:31 AM
You could find interesting the test we did a couple of years ago:



Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 06:04:11 AM
Strange results..

The unaligned test is somewhat understandable, and there are some consistency in the movdqa function, but the large time lap between movdqu seems odd.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on December 04, 2012, 06:25:32 AM
memcpy?? Looks familiar (http://www.masmforum.com/board/index.php?topic=11454.msg87610#msg87610) :biggrin:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 06:48:06 AM
memcpy?? Looks familiar (http://www.masmforum.com/board/index.php?topic=11454.msg87610#msg87610) :biggrin:

 :P

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
8.972.644   cycles for RtlZeroMemory
10.275.565   cycles for FrkTons
8.971.074   cycles for rep stosd
9.041.781   cycles for movdqa
9.046.257   cycles for movaps
9.074.924   cycles for FrkTons New
8.937.746   cycles for movups
9.068.905   cycles for movupd
6.556.637   cycles for MOVNTDQ

8.967.443   cycles for RtlZeroMemory
10.272.111   cycles for FrkTons
8.964.736   cycles for rep stosd
9.044.940   cycles for movdqa
9.044.951   cycles for movaps
9.077.441   cycles for FrkTons New
8.940.185   cycles for movups
9.072.967   cycles for movupd
6.562.758   cycles for MOVNTDQ
---------------------------------------------
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
6.065.640       cycles for RtlZeroMemory
8.193.809       cycles for FrkTons
5.992.196       cycles for rep stosd
8.180.437       cycles for movdqa
8.180.221       cycles for movaps
8.159.784       cycles for FrkTons New
8.147.703       cycles for movups
8.157.409       cycles for movupd
7.240.756       cycles for MOVNTDQ

6.016.674       cycles for RtlZeroMemory
8.220.004       cycles for FrkTons
5.995.494       cycles for rep stosd
8.172.316       cycles for movdqa
8.172.054       cycles for movaps
8.167.484       cycles for FrkTons New
8.286.491       cycles for movups
8.270.679       cycles for movupd
7.305.659       cycles for MOVNTDQ

We either all use the same CPU or stick to the basic then  :lol:

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 07:18:43 AM

We either all use the same CPU or stick to the basic then  :lol:


Optimization is quite a strange beast indeed, it comes and goes
depending on many [maybe too many] factors.
Nevertheless we can try and find something that we didn't expect :lol:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on December 04, 2012, 07:36:25 AM
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
6.065.640       cycles for RtlZeroMemory
5.992.196       cycles for rep stosd
7.240.756       cycles for MOVNTDQ

It seems Intel is still working on rep stosd..!
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 27, 2018, 12:47:53 PM
I tried it for fun (haven't even read messages here though so this is prolly bad code) :
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on October 27, 2018, 10:22:17 PM
Gabriel,

I tried to use the 3 algos to see what they did but I could not get it working, would it be possible for you to put the algos in a test piece ?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 28, 2018, 05:16:05 AM
I think I did basically what the original post asked for, which was to convert an array of 4096 dwords to an array of 4096 bytes (src is the 4096 dword array and dst the 4096 byte array)

I'll try making a more flexible algorithm that doesn't assume size nor alignment I guess lol.

The different functions do the same thing, just with different extensions (progressing through them)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 28, 2018, 11:57:00 PM
So I made a few new revised ones (attached), which should probably be a lot easier to test.

Basically they do what the original post asked to do, which is to move sz dwords from the array pointed to by src, transform them into bytes, and then put them into the array pointed to by dst.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on October 29, 2018, 06:26:26 AM
What have you built this with ? I downloaded the latest version of ML.EXE but it throws an error on "movd".

Microsoft (R) Macro Assembler Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: K:\asm32\gabriel\2\dwordToByte.asm
K:\asm32\gabriel\2\dwordToByte.asm(90) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(91) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(182) : error A2070:invalid instruction operands

OK, that was easy to fix, just added DWORD PTR to the three lines and it builds into an object module with no problems.

I prototyped the three procedures so it should be callable using normal MASM "invoke".

    dwordToByte      PROTO dst:dword, src:dword, sz:dword
    dwordToByteSSE2  PROTO dst:dword, src:dword, sz:dword
    dwordToByteSSSE3 PROTO dst:dword, src:dword, sz:dword

Now all I need is the data format that calls the three procedures.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 29, 2018, 11:15:41 PM
I used JWasm, I guess it's more flexible and knows that movd always uses dword ptr for memory operands

Also, I used dword instead of ptr for dst and src lel.

But um dst is a pointer to an array of sz bytes (so its size is sz), src is a pointer to an array of sz dwords (so its size is sz * 4), and sz is the size.

Also I attached a version that should work with MASM. I also just found Uasm, so I'll prolly be able to make versions for AVX2 and later.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on November 01, 2018, 09:31:56 AM
You will certainly do better with UASM as John has done some very good work there, JWASM was rough around the edges and was not properly MASM compatible. Don't be afraid to have a look at nidud's ASMC either as he has done a lot of good work as well. It is pretty much the case that ML.EXE is the only 32 bit MASM compatible assembler as it is the reference.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: AW on November 01, 2018, 09:39:21 PM
On the Intel 64 and IA-32 Architectures Optimization Reference Manual there is a chapter on Data Gather and Scatter which is actually what we are talking about here.