The MASM Forum

General => The Laboratory => Topic started by: frktons on November 25, 2012, 02:48:06 AM

Title: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 02:48:06 AM
Hi everybody.

It has been a long time since the last MASM FOR FUN adventure.

Time to be back?

I hope so, at least for a while.

Let's start  with a question.

I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.

The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.

I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.

Anybody has gone through this problem and found an SSE2/SSE3 solution?

In pseudocode:


xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0,  n

And we have CIAO in eax.

CIAO








Title: Re: MASM FOR FUN is back?
Post by: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.

Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:06:39 AM
Quote from: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.

Thanks, it's good to be back.

With the old instructions I get:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098   cycles for MOV DL

28009   cycles for MOV DL


--- ok ---


Attached the source and exe.

Frank
Title: Re: MASM FOR FUN is back?
Post by: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy








































Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:29:02 AM
Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.

Frank
Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 04:56:19 AM
I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:

if I have four xmm registers with the dword read from the dword buffer



    movd      xmm0,  [ebx]
    movd      xmm1,  [ebx + 4]   
    movd      xmm2,  [ebx + 8] 
    movd      xmm3,  [ebx + 12] 
   



interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.

While I wait for some suggestions, I carry on thinking and reading  :icon_rolleyes:

Frank
Title: Re: MASM FOR FUN is back?
Post by: CommonTater on November 25, 2012, 06:20:01 AM
Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.

However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores.  In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.
Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 06:57:46 AM
Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.


mov eax, offset Dest
mov ebx, offset Source
@@:
movd mm0, dword ptr [ebx]
                movd mm1, dword ptr [ebx + 4]
                movd mm2, dword ptr [ebx + 8]
                movd mm3, dword ptr [ebx + 12]

                punpcklbw mm0, mm2
                punpcklbw mm1, mm3   
                punpcklbw mm0, mm1
                 
movd dword ptr [eax], mm0



Title: Re: MASM FOR FUN is back?
Post by: frktons on November 25, 2012, 07:16:57 AM
Well, it apparently works with some speed improvement:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315   cycles for MOV DL

16479   cycles for MOVD DWORD PTR

28394   cycles for MOV DL

16637   cycles for MOVD DWORD PTR


--- ok ---



I'll see if a there is a better method, with less SSE OPCODES to get
better performance.

Frank
Title: Re: MASM FOR FUN is back?
Post by: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)
Title: Re: MASM FOR FUN - REBORN
Post by: frktons on November 25, 2012, 08:04:02 AM
Quote from: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

Ciao jj.

I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.

Frank
Title: Re: MASM FOR FUN - REBORN
Post by: jj2007 on November 25, 2012, 09:44:27 AM
Quote from: frktons on November 25, 2012, 08:04:02 AM
What do you think about the solution I used?

Not fast indeed. PSHUFB must be much better, but I can't test it here...

See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.

Did you try the straightforward solution?

include \masm32\include\masm32rt.inc
.data
src   db "xxxCxxxIxxxAxxxO", 0
dest   db 20 dup(?)
.code
start:
   mov esi, offset src+3
   mov edi, offset dest
   REPEAT 4
      lodsd
      stosb
   ENDM
   inkey offset dest
   exit
end start
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 11:52:26 AM
I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12412   cycles for MOV DL
6199    cycles for MMX/MOVD DWORD PTR
21114   cycles for STOSB
---------------------------------------------------------
12401   cycles for MOV DL
6235    cycles for MMX/MOVD DWORD PTR
21137   cycles for STOSB
---------------------------------------------------------

--- ok ---


Frank
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you   :t

this might have a few less dependancies...

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 12:22:09 PM
Quote from: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you   :t

this might have a few less dependancies...

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B


Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about JNZ:


        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

[/code]

does the jnz refers to ECX
or to EAX?


By the way, I inserted your code inside the pgm, but I get strange results:




invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source

    @@:     mov     edx,[ebx]
            add     ebx,4
            mov     [eax],dl
            dec     ecx           
            lea     eax,[eax+1]           

            jnz     @B
           
            print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10



and:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12396   cycles for MOV DL
6193    cycles for MMX/MOVD DWORD PTR
21098   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------
12394   cycles for MOV DL
6207    cycles for MMX/MOVD DWORD PTR
21093   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------

--- ok ---


Did I mispell something, or what?



Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 12:45:29 PM
nope - i just write bad code   :lol:

LEA does not affect the flags, so
        lea     eax,[eax+1]
adds one to EAX without altering the flags that were set by DEC ECX
the idea was to put something in between the instruction that sets the flags and the one that examines them
but, LEA is not a great performer on older CPU's

still, it shouldn't be that slow - lol
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 12:52:01 PM
i got a slight improvement on my p4 prescott

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov eax, offset Dest
    mov ecx, 4096
    mov ebx, offset Source

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

; @@:
;     mov edx, [ebx]
;     mov byte ptr [eax], dl
;     add eax, 1
;                add ebx, 4
;     dec ecx
;     jnz @B
counter_end
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: sinsi on November 25, 2012, 01:34:26 PM
Maybe try

  mov dl,[ebx]
  mov dh,[ebx+4]
  mov [eax],dx

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on November 25, 2012, 01:44:48 PM
Dave,

It was only the PIV that was a poor performer with LEA, PIII and earlier and Core2 onwards are fine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 25, 2012, 01:51:21 PM
i knew it was something like that, Hutch - lol

sinsi has the right idea, i think...
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov ebx, offset Source
    mov eax, offset Dest
    mov ecx, 4096/4

@@:     mov     dh,[ebx+12]
        mov     dl,[ebx+8]
        shl     edx,16
        mov     dh,[ebx+4]
        mov     dl,[ebx]
        add     ebx,16
        mov     [eax],edx
        dec     ecx
        lea     eax,[eax+4]
        jnz     @B

counter_end


it would help if the destination array is 4-aligned - maybe the source, too
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 25, 2012, 07:52:55 PM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 25, 2012, 09:04:52 PM
So on your puter Sinsi's solution is clearly the fastest. That's what I suspected ;-)

Not on mine, however:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9968    cycles for MOV AX
9622    cycles for LEA
5181    cycles for MMX/MOVD DWORD PTR
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 10:19:15 PM
On newer machine there is no game:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13874   cycles for MOV AX
13124   cycles for LEA
6193    cycles for MMX/MOVD DWORD PTR
18486   cycles for STOSB
---------------------------------------------------------
13856   cycles for MOV AX
13087   cycles for LEA
4129    cycles for MMX/MOVD DWORD PTR
18516   cycles for STOSB
---------------------------------------------------------

--- ok ---



later the pshufb solution that should win the race.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 25, 2012, 11:15:09 PM
Here it is, the first quick shot with PSHUFB:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13866   cycles for MOV AX
13084   cycles for LEA
6205    cycles for MMX/MOVD DWORD PTR
4848    cycles for PSHUFB / I shot
18487   cycles for STOSB
---------------------------------------------------------
13852   cycles for MOV AX
13083   cycles for LEA
6194    cycles for MMX/MOVD DWORD PTR
4730    cycles for PSHUFB / I shot
18518   cycles for STOSB
---------------------------------------------------------

--- ok ---


later I'll try to improve it.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 01:13:49 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 02:19:29 AM
Some more tests:


Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz (SSSE3)
---------------------------------------------------------
13927   cycles for MOV AX
13097   cycles for LEA
6203    cycles for MMX/PUNPCKLBW
4729    cycles for XMM/PSHUFB - I shot
3518    cycles for XMM/PSHUFB - II shot
15364   cycles for XMM/MASKMOVDQU - I shot
18506   cycles for STOSB
---------------------------------------------------------
13868   cycles for MOV AX
13096   cycles for LEA
6198    cycles for MMX/PUNPCKLBW
4732    cycles for XMM/PSHUFB - I shot
3520    cycles for XMM/PSHUFB - II shot
15360   cycles for XMM/MASKMOVDQU - I shot
18503   cycles for STOSB
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 26, 2012, 03:25:54 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 04:34:28 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 06:12:38 AM
Quote from: jj2007 on November 26, 2012, 03:25:54 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O


If the mask is correctly set, maskmovdqu should do the job :

F0h = byte to move, 00h = byte non moved, according to Intel Docs:


The most significant bit in each byte of the mask operand determines whether the
corresponding byte in the source operand is written to the corresponding byte location
in memory: 0 indicates no write and 1 indicates write.
[/b]



At least my previous test showed it can do the job, but maybe I didn't try it enough. ::)

Edit: it only works fine with consecutive bytes, probably, as you said, not the one I need
beside being slow.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 06:17:17 AM
Quote from: nidud on November 26, 2012, 04:34:28 AM

QuoteIntel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
6334    cycles for MOV AX
5171    cycles for LEA
3123    cycles for MMX/MOVD DWORD PTR
2189    cycles for PSHUFB / I shot
10503   cycles for STOSB
---------------------------------------------------------
5243    cycles for MOV AX
5488    cycles for LEA
3150    cycles for MMX/MOVD DWORD PTR
2060    cycles for PSHUFB / I shot
9276    cycles for STOSB
---------------------------------------------------------


These SIMD instructions work a lot better with modern tech.
Try the last version only on I3.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 06:52:59 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 07:28:02 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 08:01:46 AM
Quote from: nidud on November 26, 2012, 07:28:02 AM
With adjustment for the loop (mov ecx,4096/16):
Quote---------------------------------------------------------
1125    cycles for XMM/PSHUFB - I shot
1088    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
1124    cycles for XMM/PSHUFB - I shot
1140    cycles for XMM/PSHUFB - II shot
---------------------------------------------------------


In the first shot 4096/4 refers to the dwords to elaborate in each cycle,
so it cannot be 4096/16.
The second one works on 16 dwords at a time, so it is 4096/16.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 08:19:47 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 08:57:04 AM
I'm going to prepare a real test, with some data to make the masks
a little bit more accurate. They are not tested for the time being, and
were used just to have an idea of their performances.

After testing on real data and adjusting the masks accordingly, the
test could be considered valid.

Up to now I've worked on uninitializes data, so there is no way to
know if the sequence of bit/bytes in the masks are correct.  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 09:43:40 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 26, 2012, 10:09:04 AM
Quote from: nidud on November 26, 2012, 09:43:40 AM
I think it does what it suppose to do

Well, a good test should start with 4096 dword initialized with 00000001h
and then use the single routines with it, testing if at the end there are all 01h
in the Dest buffer, and to verify it you could use your routine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 26, 2012, 10:29:39 AM
Hi Frank,

Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.

Hope it helps,
Jochen
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 26, 2012, 11:54:27 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 05:49:25 AM
Quote from: nidud on November 26, 2012, 11:54:27 AM
This was implemented using macros for each test. You need to reset the byte buffer (Dest) for each test, using 0 if source is 1, or 1 if source is 0 (as in this case).

Yes nidud, thanks.

Quote from: jj2007 on November 26, 2012, 10:29:39 AM
Hi Frank,

Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.

Hope it helps,
Jochen

Grazie Jochen, il tuo aiuto è sempre benvenuto.

I'll give it a look as soon as I finish a couple of prelimary
things I'm working on.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:15:37 AM
Is there an opcode to compare two xmm register to verify
if they have the same content?

Again SIMD instructions are a bit tricky for simple instructions.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 27, 2012, 09:24:01 AM
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer's Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:37:57 AM
Quote from: qWord on November 27, 2012, 09:24:01 AM
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer's Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)

Yes qWord,

Let's assume I use:


   PCMPEQD xmm0,xmm1


considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

PTEST affect the Zero Flag, but the opcode is out of my league (SSE4.1).

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 10:17:09 AM
The first correct test for SSE instructions with proc to check the results:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13862   cycles for MOV AX - Test OK
13114   cycles for LEA - Test OK
6195    cycles for MMX/PUNPCKLBW - Test OK
3157    cycles for XMM/PSHUFB - I shot - Test OK
2375    cycles for XMM/PSHUFB - II shot - Test OK
12327   STOSB - Test OK
---------------------------------------------------------
9238    cycles for MOV AX - Test OK
8723    cycles for LEA - Test OK
4130    cycles for MMX/PUNPCKLBW - Test OK
3150    cycles for XMM/PSHUFB - I shot - Test OK
2375    cycles for XMM/PSHUFB - II shot - Test OK
16701   STOSB - Test OK
---------------------------------------------------------

--- ok ---


Attached last version.

Enjoy
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 27, 2012, 11:12:38 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:25:12 AM
Quote from: nidud on November 27, 2012, 11:12:38 AM
Seems to be possible to compare the low 8 bytes:
COMISD dest,source

The destination operand is an XMM register.
The source can be either an XMM register or a memory location.

The flags are set according to the following rules:
Result Flags  Values
Unordered ZF,PF,CF  111
Greater than ZF,PF,CF  000
Less than ZF,PF,CF  001
Equal ZF,PF,CF  100


Maybe it's possible to shift (or rotate) the regs and then compare the high 8 bytes?

Probably there are many ways to do it in more than 1 step.
I'm trying to find a single SIMD instruction, like PTEST, for the task
included in level SSE3.
Some more checking and I'll see.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 27, 2012, 11:27:06 AM
Quote from: frktons on November 27, 2012, 09:37:57 AM
Let's assume I use:


   PCMPEQD xmm0,xmm1


considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:48:54 AM
Quote from: jj2007 on November 27, 2012, 11:27:06 AM
Quote from: frktons on November 27, 2012, 09:37:57 AM
Let's assume I use:


   PCMPEQD xmm0,xmm1


considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?

psubd xmm0, xmm1
pmovmskb eax, xmm0   ; set byte mask in eax
test eax, eax

Thanks Jochen, I'll arrange a new algo to test with your
suggestion.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 09:37:20 PM
I wrote a new CheckDestX PROC to use Jochen suggestion:

; -----------------------------------------------------------------------------------------------
CheckDestX proc

    lea eax, Dest
    mov ebx, 32323232h
   
    mov ecx, (4096/16)

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

@@:

    movdqa xmm1, [eax]

    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in edx
    test edx, edx   

    jne CheckErr
   
       
    add eax, 16
    dec ecx
    jnz @B

CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

EndCheck:

    ret

CheckDestX endp


It gives the same results as CheckDest PROC and
probably is quite fast, but I didn't still test the performance of it.

But I'm still not satisfied from CPUID results:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13876   cycles for MOV AX - Test OK
8740    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12336   STOSB - Test OK
---------------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4131    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2376    cycles for XMM/PSHUFB - II shot - Test OK
12330   STOSB - Test OK
---------------------------------------------------------

--- ok ---


This time I've used PrintCpu and MasmBasic include,
but the results are still not accurate. My PC has SSSE3
capability, not SSE4.

Only Alex's code that I used a couple of year ago gives
a more accurate result:

┌─────────────────────────────────────────────────────────────[27-Nov-2012 at 10:57 GMT]─┐
│OS  : Microsoft Windows 7 Ultimate Edition, 64-bit Service Pack 1 (build 7601)          │
│CPU : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz with 2 logical core(s) with SSSE3           │


I've read the thread about the CPUID code, but didn't find anything new.
Should I still use Alex's code or there is a more accurate routine for modern
CPU?

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 10:16:07 PM
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 10:37:52 PM
Quote from: dedndave on November 27, 2012, 10:16:07 PM
CPU's may have changed a lot
but, operating systems change at a slower rate   :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know

our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)

i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on

Yes Dave, the reasoning is quite fair.
I'm talking about the uncorrect data shown by old routines
while we have newer routines, like Alex's one, that are more
accurate, even if they don' go above SSE4.X.
Jochen's library is quite up to date and uses many SSE opcode [I imagine]
but the Macro [I think] PrintCpu should be updated to be more
correct, doesn't matter if it doesn't cover last AVX code or the like.

Well it is just my opinion, of course. Even the CPUID utility that Intel gives us
http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838 (http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838)
doesn't show that my PC has SSSE3 capabilities, but at least it doesn't say I have
SSE4.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 10:59:01 PM
oh - i see what you mean
well - there have been a few that report erroneously
but, to programatically determine if a specific extension is supported is pretty easy
i.e., i wouldn't use "Alex's" or "Jochen's" or even "Dave's" routine
their purpose is to identify the CPU and capabilities, primarily for forum comparisons

that is a different function than identifying extension support for a program to select routines
what you want to do is actually much simpler   :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:33:13 PM
;               0_1 values come from CPUID function 1
;               8_1 values come from CPUID function 80000001h
;
;                Source        Description
;
;                0_1edx:23     MMX
;                8_1edx:22     MMX+    (AMD only)
;                8_1edx:31     3DNow!  (AMD only)
;                8_1edx:30     3DNow!+ (AMD only)
;                0_1edx:25     SSE
;                0_1edx:26     SSE2
;                0_1ecx:00     SSE3
;                0_1ecx:09     SSSE3
;                0_1ecx:19     SSE4.1
;                0_1ecx:20     SSE4.2  (Intel only)
;                8_1ecx:06     SSE4a   (AMD only)
;                8_1ecx:11     SSE5    (AMD only) - this became one of the AVX feature bits


you can get most of what you want to know by examining ECX and EDX after this...
        mov     eax,1
        cpuid

for example, ECX bit 0 will be 1 if SSE3 is supported
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:40:19 PM
Thanks Dave.

CPUID is still an unknown land, I've never been in those bit-area.
Your introduction to the matter looks interesting, I'll give it a try.  :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:41:13 PM
i updated it a little Frank - you may want to reload the page   :P

oh - and you have to use .586 or higher  to use CPUID   :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 27, 2012, 11:46:28 PM
I SEE SSE on the SEASHORE  :icon_eek: 8)
Good to know.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 27, 2012, 11:50:06 PM
say that 5 times real fast   :lol:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 12:59:33 AM
Quote from: jj2007 on November 27, 2012, 11:27:06 AM

psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax



This code is a little bit faster on my Core 2 duo:

    psubd xmm1, xmm0
    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0 


Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 28, 2012, 02:01:18 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 07:26:30 AM
Well nidud  :t

this seems to work as well as psubd, at the same performance.
So we have a couple of alternatives, at least.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: habran on November 28, 2012, 08:50:00 AM
nidud's code:

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
---------------------------------------------------------
2988    cycles for XMM/pcmpeqd
3004    cycles for XMM/psubd
---------------------------------------------------------
2987    cycles for XMM/pcmpeqd
3012    cycles for XMM/psubd
---------------------------------------------------------
2978    cycles for XMM/pcmpeqd
3001    cycles for XMM/psubd
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 09:48:05 AM

----------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------
9242    cycles for MOV AX - Test OK
8731    cycles for LEA - Test OK
4144    cycles for MMX/PUNPCKLBW - Test OK
3158    cycles for XMM/PSHUFB - I shot - Test OK
2368    cycles for XMM/PSHUFB - II shot - Test OK
12328   cycles for STOSB - Test OK
2070    cycles for CheckDest - Test OK
547     cycles for CheckDestC - Test OK
544     cycles for CheckDestX - Test OK
----------------------------------------------------
9241    cycles for MOV AX - Test OK
8728    cycles for LEA - Test OK
4130    cycles for MMX/PUNPCKLBW - Test OK
3153    cycles for XMM/PSHUFB - I shot - Test OK
2379    cycles for XMM/PSHUFB - II shot - Test OK
12335   cycles for STOSB - Test OK
2069    cycles for CheckDest - Test OK
548     cycles for CheckDestC - Test OK
543     cycles for CheckDestX - Test OK
----------------------------------------------------


CheckDestC is nidud's code modified. For the CPU and SSE level
I used Alex's routine.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 28, 2012, 05:53:33 PM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: habran on November 28, 2012, 08:02:29 PM
last nidud's code produce this on my laptop:


---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
6675    cycles for STOSB - Test OK
4240    cycles for LEA - Test OK
3353    cycles for MOV DX - Test OK
3276    cycles for MOV AX - Test OK
1924    cycles for MMX/PUNPCKLBW - Test OK
1213    cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
1539    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6093    cycles for STOSB - Test OK
3806    cycles for LEA - Test OK
3403    cycles for MOV DX - Test OK
3277    cycles for MOV AX - Test OK
1945    cycles for MMX/PUNPCKLBW - Test OK
808     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
904     cycles for XMM/PSHUFB - I shot - Test OK
1490    cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3668    cycles for MOV DX - Test OK
3684    cycles for MOV AX - Test OK
3044    cycles for MMX/PUNPCKLBW - Test OK
888     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
901     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289    cycles for STOSB - Test OK
3805    cycles for LEA - Test OK
3240    cycles for MOV DX - Test OK
3255    cycles for MOV AX - Test OK
2527    cycles for MMX/PUNPCKLBW - Test OK
833     cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832     cycles for XMM/PSHUFB - I shot - Test OK
858     cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------

--- ok ---
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 28, 2012, 10:51:37 PM
Quote from: nidud on November 28, 2012, 05:53:33 PM
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.

With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.

Edit: renamed test_pshufb to test_pshufb0

Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.
I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't.  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 28, 2012, 11:05:21 PM
Quote from: frktons on November 28, 2012, 10:51:37 PM
The PMOVMSKB does or doesn't detect the difference?

It does. Launch some tests with Olly to see what happens. Anyway, PCM*** does the same job as PSUBD, and they are equally fast (e.g. one cycle on my AMD).
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 12:22:22 AM
I compare two XMM register, with one of them greater
than the other.
According to this test, with PSUBD it doesn't detect it ::)

------------------------------------
Test on PCMPEQD - Test ERR
------------------------------------
Test on PSUBD   - Test OK
------------------------------------

Press any key to continue ...


This is the code I used. Did I make any error?


; ---------------------------------------------------------------------
; TEST_PSUBD.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;-------------------------------------------------------------------------------
; Test the difference between PCMPEQD and PSUBD when comparing two XMM
; registers.
; 28/Nov/2012 - MASM FORUM - frktons
;-------------------------------------------------------------------------------



.nolist
include \masm32\include\masm32rt.inc
.686
.xmm


.data

align 8
Check db  8  dup(20h),0,0,0,0
PtrCheck dd  Check

align 8
TestOK db  "Test OK ",0,0,0,0
align 8
TestERR db  "Test ERR",0,0,0,0


.code

start:


print "---------------------------------------------------------", 13, 10
      print "Test on PCMPEQD - "
      call  PCMP_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10

      print "Test on PSUBD   - "
      call  PSUB_TEST
      print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10, 13, 10
      inkey

      exit
     
; -----------------------------------------------------------------------------------------------
PSUB_TEST proc


    mov ebx, 32323232h
    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    psubd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0

    jne CheckErr

CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

EndCheck:

    ret

PSUB_TEST endp

; -----------------------------------------------------------------------------------------------
PCMP_TEST proc


    mov ebx, 32323232h

    mov edx, 00000001h

    movd xmm2, edx
    pshufd xmm2, xmm2, 0

    movd xmm0, ebx
    pshufd xmm0, xmm0, 0

    movdqa xmm1, xmm0

    paddd  xmm1, xmm2

    pcmpeqd xmm1,xmm0

    pmovmskb edx, xmm1   ; set byte mask in dx
    cmp   dx, 0FFFFh

    jne CheckErr

CheckOK:

    lea eax, Check
    movq mm0, qword ptr TestOK
    movq qword ptr [eax], mm0
    jmp  EndCheck

CheckErr:

    lea eax, Check
    movq mm0, qword ptr TestERR
    movq qword ptr [eax], mm0

EndCheck:

    ret

PCMP_TEST endp


end start

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 01:08:17 AM
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 01:44:12 AM
Quote from: jj2007 on November 29, 2012, 01:08:17 AM
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd  for xmm1=xmm0: xmm1 becomes 0h


So what is my error? I was aware that the logic is inverted
and I tested:

    cmp    dx, 0

    jne CheckErr

for PSUBD, and


    cmp   dx, 0FFFFh

    jne CheckErr


for PCMPEQD. ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 03:12:22 AM
It seems pcmpeqb returns always zero, unless the xmm bytes are FFh...
---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008641903833650
xmm0            3617008641903833650
pcmpeqd out     xmm1            -1

pmovmskb
xmm1            -1
edx             65535
Test OK
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008641903833650
xmm0            3617008641903833650
PSubD out       xmm1            0

pmovmskb
xmm1            0
dx              0
Test OK
---------------------------------------------------------

---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1            3617008646198800947
xmm0            3617008641903833650
pcmpeqd out     xmm1            0

pmovmskb
xmm1            0
edx             0
Test ERR
---------------------------------------------------------
Test on PSUBD   -
PSubD in
xmm1            3617008646198800947
xmm0            3617008641903833650
PSubD out       xmm1            4294967297

pmovmskb
xmm1            4294967297
dx              0
Test OK
---------------------------------------------------------
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 04:17:03 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 04:37:01 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:41:35 AM
When I read the Intel Manuals, about PMOVMSKB
I found something didn't match with the possibility to
compare two XMM register for equality:

Creates a mask made up of the most significant bit of each byte of the source
operand (second operand) and stores the result in the low byte or word of the destination
operand (first operand).

If only the MSBits are stored into the destination operand, and the difference is in other
bits, it will not be detected.
So My idea is that after PSUBD we have to use a different opcode to
detect is there are differences other then in the MSBits of the xmm we are testing.

On the other side, using PCMPEQD we can test both the equality and the difference
between the xmm registers, using PMOVMSKB.
This is what I've undestood so far.
Using PSUBD is a smart solution but it need to be followed by something
different than PMOVMSKB, in my opinion.

So far nidud's solution is the one I understand. Waiting for some other solution.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:43:35 AM
Quote from: nidud on November 29, 2012, 04:37:01 AM
Maybe you could use CMPNEQPS
The result should then be zero if equal


Yes, probably this opcode will work as well.

Quote
What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?

yes again. So far I think the PCMPEQD variant is the complete one
for testing equality. Something is missing, in my opinion for PSUBD.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 04:54:25 AM
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?

xmm integer packed data is what I'd like to compare.

I want to know, for example, if xmm0 is equal to xmm1.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:00:07 AM
Quote from: frktons on November 29, 2012, 04:54:25 AM
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?

xmm integer packed data is what I'd like to compare.

I want to know, for example, if xmm0 is equal to xmm1.
so, you are rigth with PCMPEQxx + PMOVMSKB.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 05:02:03 AM
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?

Test for equality only, so FP or INT won't make a difference. Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc). In any case, PCMPEQxx is the right choice, as qWord already wrote.

Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:07:05 AM
Quote from: jj2007 on November 29, 2012, 05:02:03 AM
Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc).
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on November 29, 2012, 05:13:34 AM
Quote from: qWord on November 29, 2012, 05:07:05 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.

Grazie, good to know :t
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 05:15:49 AM
Quote from: jj2007 on November 29, 2012, 05:13:34 AM
Quote from: qWord on November 29, 2012, 05:07:05 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.

Grazie, good to know :t
ups..., that applies for the unordered compare, for CMPEQPS it allways return false.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 05:24:12 AM
I'm glad to see everybody agreed eventually.  :t

Now let's go further.

How do I compare for greater than?
:P

Same registers, same data type.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: dedndave on November 29, 2012, 06:18:43 AM
reverse the operands ?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: qWord on November 29, 2012, 06:20:50 AM
Quote from: frktons on November 29, 2012, 05:24:12 AM
Now let's go further.

How do I compare for greater than?
:P

Same registers, same data type.
PCMPGTD  ::)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 06:22:44 AM
Quote from: qWord on November 29, 2012, 06:20:50 AM
Quote from: frktons on November 29, 2012, 05:24:12 AM
Now let's go further.

How do I compare for greater than?
:P

Same registers, same data type.
PCMPGTD  ::)

Thanks qWord, maybe this time I'll take less time to get the info.
Well it looks quite simple to manage:
Quote
If a data element in the destination operand is greater
than the corresponding date element in the source operand, the corresponding data
element in the destination operand is set to all 1s; otherwise, it is set to all 0s.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 08:32:39 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 08:52:13 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 09:52:49 AM
Quote from: nidud on November 29, 2012, 08:32:39 AM
To conclude:

pcmpeqb compares 16 bytes, pcmpeqd compares 4 doublewords, result in xmm0 is the same.
pcmpgtb compares 16 bytes, pcmpgtd compares 4 doublewords, result in xmm0 is not the same.


pcmpeqb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if equal
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,0FFFFh
je is_equal

pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great



I don't think so about the test for GT. So far I got:

when we use pcmpgtb
we have FF only in the bytes that are greater, not all of them,
and the same for the dword, with pcmpgtd,only the dword that are greater
are switched to FF, the remaining of them are switched to
00 if are equal or less than.

Instead of this:

pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great


Something like:


pcmpgtd xmm0,xmm1
pmovmskb eax,xmm0
        .if bit ax, 15
            jmp IsGreater
        .endif


If bit 15 not 1 The way is longer, and we have to test
other things:


pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsGreater;  The second original value tested
        .endif
        .if ax == bx
            jmp AreEqual
        .elseif ax > bx
            jmp IsGreater
        .else
            jmp IsLessThan
        .endif


Not already tested, but this is the idea.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 10:59:37 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 11:07:16 AM
Quote from: nidud on November 29, 2012, 10:59:37 AM

The upper byte should always be the same (FF,? or FFFFFFFF,? if greater)

Yes and not. It depends if the byte or dword tested is greater, not the whole xmm register.
But if upper byte or word or dword is FF, you know the entire xmm register is greater, otherwise
you have to check other things.

Quote from: frktons on November 29, 2012, 09:52:49 AM

pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsGreater;  The second original value tested
        .endif
        .if ax == bx
            jmp AreEqual
        .else
            jmp IsLessThan; The second original value tested
        .endif

Not already tested, but this is the idea.

assuming ax is zero, I think that is correct
[/quote]

I modified the code, there was a logical error, have a look.
It should work with whatever value ax and bx assume.

Quote
same as:
test ah,80h
jnz is_great
test ax,ax
jz is_equal
jmp is_less



I think you have to use 2 registers not only ax. According to my understanding
you cannot check if greater, equal or less than with a single passage. Only
if you are lucky you can find the answer in the first check, if the upper byte is FF
you can say xmm0 is greater than xmm1.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 11:46:50 AM
So the complete test for greater, less than or equal should
be something like this:


        pcmpgtd   xmm0,xmm1
pmovmskb eax,xmm0
        .if bit ax, 15
            jmp IsGreater
        .endif

pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
        .if bit bx, 15
            jmp IsLessThan
        .endif
        .if ax == bx
            jmp AreEqual
        .elseif ax > bx
            jmp IsGreater
        .else
            jmp IsLessThan
        .endif


I'll try the code and let you know.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on November 29, 2012, 12:00:35 PM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on November 29, 2012, 12:15:18 PM
Read again my last posts, yes we can know if the
compare gives GT, LT or EQ.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 03, 2012, 08:39:08 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 03, 2012, 09:37:11 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 05:06:14 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:11:03 AM
my test for atol:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
429117  cycles for atol LODSB
350976  cycles for atol SHL
411231  cycles for atol LEA
--------------------------------------------------------
430242  cycles for atol LODSB
509282  cycles for atol SHL
395102  cycles for atol LEA
--------------------------------------------------------


well they are a bit random, as you said.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:20:42 AM
Memcpy:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
757223  cycles for memcpy A
288975  cycles for memcpy movdqa xmm0 A
1352024 cycles for memcpy movdqu xmm0 A
1367569 cycles for memcpy movdqu xmm0..xmm7 A
5668726 cycles for memcpy movdqu xmm0 U
4563076 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------
749649  cycles for memcpy A
302916  cycles for memcpy movdqa xmm0 A
1737163 cycles for memcpy movdqu xmm0 A
1841807 cycles for memcpy movdqu xmm0..xmm7 A
6136384 cycles for memcpy movdqu xmm0 U
4055501 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------


and the last routines:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
--------------------------------------------------------
5825483 cycles for crt_memcpy A
7352188 cycles for memcpy A
7269901 cycles for memcpyd A
4146083 cycles for memcpy movdqa A
8700368 cycles for memcpy movdqu A
27212513        cycles for memcpy movdqu U
--------------------------------------------------------
6180618 cycles for crt_memcpy A
9100718 cycles for memcpy A
7028934 cycles for memcpyd A
4151090 cycles for memcpy movdqa A
11923657        cycles for memcpy movdqu A
29665081        cycles for memcpy movdqu U
--------------------------------------------------------


Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 05:35:31 AM
You could find interesting the test we did a couple of years ago:



Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 06:04:11 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on December 04, 2012, 06:25:32 AM
memcpy?? Looks familiar (http://www.masmforum.com/board/index.php?topic=11454.msg87610#msg87610) :biggrin:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: nidud on December 04, 2012, 06:48:06 AM
deleted
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: frktons on December 04, 2012, 07:18:43 AM
Quote from: nidud on December 04, 2012, 06:48:06 AM

We either all use the same CPU or stick to the basic then  :lol:


Optimization is quite a strange beast indeed, it comes and goes
depending on many [maybe too many] factors.
Nevertheless we can try and find something that we didn't expect :lol:
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: jj2007 on December 04, 2012, 07:36:25 AM
Quote from: nidud on December 04, 2012, 06:48:06 AM
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
6.065.640       cycles for RtlZeroMemory
5.992.196       cycles for rep stosd
7.240.756       cycles for MOVNTDQ

It seems Intel is still working on rep stosd..!
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 27, 2018, 12:47:53 PM
I tried it for fun (haven't even read messages here though so this is prolly bad code) :
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on October 27, 2018, 10:22:17 PM
Gabriel,

I tried to use the 3 algos to see what they did but I could not get it working, would it be possible for you to put the algos in a test piece ?
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 28, 2018, 05:16:05 AM
I think I did basically what the original post asked for, which was to convert an array of 4096 dwords to an array of 4096 bytes (src is the 4096 dword array and dst the 4096 byte array)

I'll try making a more flexible algorithm that doesn't assume size nor alignment I guess lol.

The different functions do the same thing, just with different extensions (progressing through them)
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 28, 2018, 11:57:00 PM
So I made a few new revised ones (attached), which should probably be a lot easier to test.

Basically they do what the original post asked to do, which is to move sz dwords from the array pointed to by src, transform them into bytes, and then put them into the array pointed to by dst.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on October 29, 2018, 06:26:26 AM
What have you built this with ? I downloaded the latest version of ML.EXE but it throws an error on "movd".

Microsoft (R) Macro Assembler Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: K:\asm32\gabriel\2\dwordToByte.asm
K:\asm32\gabriel\2\dwordToByte.asm(90) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(91) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(182) : error A2070:invalid instruction operands

OK, that was easy to fix, just added DWORD PTR to the three lines and it builds into an object module with no problems.

I prototyped the three procedures so it should be callable using normal MASM "invoke".

    dwordToByte      PROTO dst:dword, src:dword, sz:dword
    dwordToByteSSE2  PROTO dst:dword, src:dword, sz:dword
    dwordToByteSSSE3 PROTO dst:dword, src:dword, sz:dword

Now all I need is the data format that calls the three procedures.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: GabrielRavier on October 29, 2018, 11:15:41 PM
I used JWasm, I guess it's more flexible and knows that movd always uses dword ptr for memory operands

Also, I used dword instead of ptr for dst and src lel.

But um dst is a pointer to an array of sz bytes (so its size is sz), src is a pointer to an array of sz dwords (so its size is sz * 4), and sz is the size.

Also I attached a version that should work with MASM. I also just found Uasm, so I'll prolly be able to make versions for AVX2 and later.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: hutch-- on November 01, 2018, 09:31:56 AM
You will certainly do better with UASM as John has done some very good work there, JWASM was rough around the edges and was not properly MASM compatible. Don't be afraid to have a look at nidud's ASMC either as he has done a lot of good work as well. It is pretty much the case that ML.EXE is the only 32 bit MASM compatible assembler as it is the reference.
Title: Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
Post by: aw27 on November 01, 2018, 09:39:21 PM
On the Intel 64 and IA-32 Architectures Optimization Reference Manual there is a chapter on Data Gather and Scatter which is actually what we are talking about here.