News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords

Started by frktons, November 25, 2012, 02:48:06 AM

Previous topic - Next topic

frktons

Hi everybody.

It has been a long time since the last MASM FOR FUN adventure.

Time to be back?

I hope so, at least for a while.

Let's start  with a question.

I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.

The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.

I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.

Anybody has gone through this problem and found an SSE2/SSE3 solution?

In pseudocode:


xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0,  n

And we have CIAO in eax.

CIAO








There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Magnum

Take care,
                   Andy

Ubuntu-mate-18.04-desktop-amd64

http://www.goodnewsnetwork.org

frktons

Quote from: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.

Thanks, it's good to be back.

With the old instructions I get:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098   cycles for MOV DL

28009   cycles for MOV DL


--- ok ---


Attached the source and exe.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Magnum

Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy








































Take care,
                   Andy

Ubuntu-mate-18.04-desktop-amd64

http://www.goodnewsnetwork.org

frktons

Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

#5
I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:

if I have four xmm registers with the dword read from the dword buffer



    movd      xmm0,  [ebx]
    movd      xmm1,  [ebx + 4]   
    movd      xmm2,  [ebx + 8] 
    movd      xmm3,  [ebx + 12] 
   



interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.

While I wait for some suggestions, I carry on thinking and reading  :icon_rolleyes:

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

CommonTater

Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.

However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores.  In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.

frktons

Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.


mov eax, offset Dest
mov ebx, offset Source
@@:
movd mm0, dword ptr [ebx]
                movd mm1, dword ptr [ebx + 4]
                movd mm2, dword ptr [ebx + 8]
                movd mm3, dword ptr [ebx + 12]

                punpcklbw mm0, mm2
                punpcklbw mm1, mm3   
                punpcklbw mm0, mm1
                 
movd dword ptr [eax], mm0



There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Well, it apparently works with some speed improvement:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315   cycles for MOV DL

16479   cycles for MOVD DWORD PTR

28394   cycles for MOV DL

16637   cycles for MOVD DWORD PTR


--- ok ---



I'll see if a there is a better method, with less SSE OPCODES to get
better performance.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

frktons

Quote from: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

Ciao jj.

I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: frktons on November 25, 2012, 08:04:02 AM
What do you think about the solution I used?

Not fast indeed. PSHUFB must be much better, but I can't test it here...

See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.

Did you try the straightforward solution?

include \masm32\include\masm32rt.inc
.data
src   db "xxxCxxxIxxxAxxxO", 0
dest   db 20 dup(?)
.code
start:
   mov esi, offset src+3
   mov edi, offset dest
   REPEAT 4
      lodsd
      stosb
   ENDM
   inkey offset dest
   exit
end start

frktons

I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12412   cycles for MOV DL
6199    cycles for MMX/MOVD DWORD PTR
21114   cycles for STOSB
---------------------------------------------------------
12401   cycles for MOV DL
6235    cycles for MMX/MOVD DWORD PTR
21137   cycles for STOSB
---------------------------------------------------------

--- ok ---


Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

hiyas Frank - good to see you   :t

this might have a few less dependancies...

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

frktons

Quote from: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you   :t

this might have a few less dependancies...

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B


Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about JNZ:


        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

[/code]

does the jnz refers to ECX
or to EAX?


By the way, I inserted your code inside the pgm, but I get strange results:




invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source

    @@:     mov     edx,[ebx]
            add     ebx,4
            mov     [eax],dl
            dec     ecx           
            lea     eax,[eax+1]           

            jnz     @B
           
            print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10



and:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12396   cycles for MOV DL
6193    cycles for MMX/MOVD DWORD PTR
21098   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------
12394   cycles for MOV DL
6207    cycles for MMX/MOVD DWORD PTR
21093   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------

--- ok ---


Did I mispell something, or what?



There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama