Author Topic: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords  (Read 46530 times)

frktons

  • Member
  • ***
  • Posts: 491
MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« on: November 25, 2012, 02:48:06 AM »
Hi everybody.

It has been a long time since the last MASM FOR FUN adventure.

Time to be back?

I hope so, at least for a while.

Let's start  with a question.

I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.

The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.

I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.

Anybody has gone through this problem and found an SSE2/SSE3 solution?

In pseudocode:

Code: [Select]
xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0,  n
And we have CIAO in eax.

CIAO




 



« Last Edit: November 25, 2012, 07:54:20 AM by frktons »

Magnum

  • Member
  • *****
  • Posts: 2354
Re: MASM FOR FUN is back?
« Reply #1 on: November 25, 2012, 03:05:34 AM »
Great to see you back.

Take care,
                   Andy

Ubuntu-mate-18.04-desktop-amd64

http://www.goodnewsnetwork.org

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN is back?
« Reply #2 on: November 25, 2012, 04:06:39 AM »
Great to see you back.

Thanks, it's good to be back.

With the old instructions I get:

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098   cycles for MOV DL

28009   cycles for MOV DL


--- ok ---

Attached the source and exe.

Frank

Magnum

  • Member
  • *****
  • Posts: 2354
Re: MASM FOR FUN is back?
« Reply #3 on: November 25, 2012, 04:18:08 AM »
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy








































Take care,
                   Andy

Ubuntu-mate-18.04-desktop-amd64

http://www.goodnewsnetwork.org

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN is back?
« Reply #4 on: November 25, 2012, 04:29:02 AM »
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.

Frank

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN is back?
« Reply #5 on: November 25, 2012, 04:56:19 AM »
I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:

if I have four xmm registers with the dword read from the dword buffer

Code: [Select]

    movd      xmm0,  [ebx]
    movd      xmm1,  [ebx + 4]   
    movd      xmm2,  [ebx + 8] 
    movd      xmm3,  [ebx + 12] 
   


interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.

While I wait for some suggestions, I carry on thinking and reading  :icon_rolleyes:

Frank
« Last Edit: November 25, 2012, 06:08:21 AM by frktons »

CommonTater

  • Guest
Re: MASM FOR FUN is back?
« Reply #6 on: November 25, 2012, 06:20:01 AM »
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
8127    cycles for MOV DL

8372    cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.

However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores.  In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN is back?
« Reply #7 on: November 25, 2012, 06:57:46 AM »
Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.

Code: [Select]
mov eax, offset Dest
mov ebx, offset Source
@@:
movd mm0, dword ptr [ebx]
                movd mm1, dword ptr [ebx + 4]
                movd mm2, dword ptr [ebx + 8]
                movd mm3, dword ptr [ebx + 12]

                punpcklbw mm0, mm2
                punpcklbw mm1, mm3   
                punpcklbw mm0, mm1
                 
movd dword ptr [eax], mm0



frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN is back?
« Reply #8 on: November 25, 2012, 07:16:57 AM »
Well, it apparently works with some speed improvement:

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315   cycles for MOV DL

16479   cycles for MOVD DWORD PTR

28394   cycles for MOV DL

16637   cycles for MOVD DWORD PTR


--- ok ---


I'll see if a there is a better method, with less SSE OPCODES to get
better performance.

Frank

jj2007

  • Member
  • *****
  • Posts: 10544
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN is back?
« Reply #9 on: November 25, 2012, 07:50:00 AM »
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN
« Reply #10 on: November 25, 2012, 08:04:02 AM »
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
    include \masm32\include\masm32rt.inc
    .686p
    .xmm


... and you need a modern CPU ;-)

Ciao jj.

I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.

Frank

jj2007

  • Member
  • *****
  • Posts: 10544
  • Assembler is fun ;-)
    • MasmBasic
Re: MASM FOR FUN - REBORN
« Reply #11 on: November 25, 2012, 09:44:27 AM »
What do you think about the solution I used?

Not fast indeed. PSHUFB must be much better, but I can't test it here...

See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.

Did you try the straightforward solution?

include \masm32\include\masm32rt.inc
.data
src   db "xxxCxxxIxxxAxxxO", 0
dest   db 20 dup(?)
.code
start:
   mov esi, offset src+3
   mov edi, offset dest
   REPEAT 4
      lodsd
      stosb
   ENDM
   inkey offset dest
   exit
end start

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #12 on: November 25, 2012, 11:52:26 AM »
I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12412   cycles for MOV DL
6199    cycles for MMX/MOVD DWORD PTR
21114   cycles for STOSB
---------------------------------------------------------
12401   cycles for MOV DL
6235    cycles for MMX/MOVD DWORD PTR
21137   cycles for STOSB
---------------------------------------------------------

--- ok ---

Frank

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #13 on: November 25, 2012, 11:56:38 AM »
hiyas Frank - good to see you   :t

this might have a few less dependancies...

Code: [Select]
@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

frktons

  • Member
  • ***
  • Posts: 491
Re: MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords
« Reply #14 on: November 25, 2012, 12:22:09 PM »
hiyas Frank - good to see you   :t

this might have a few less dependancies...

Code: [Select]
@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about JNZ:

Code: [Select]
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B
[/code]

does the jnz refers to ECX
or to EAX?


By the way, I inserted your code inside the pgm, but I get strange results:


Code: [Select]

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source

    @@:     mov     edx,[ebx]
            add     ebx,4
            mov     [eax],dl
            dec     ecx           
            lea     eax,[eax+1]           

            jnz     @B
           
            print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10


and:

Code: [Select]
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12396   cycles for MOV DL
6193    cycles for MMX/MOVD DWORD PTR
21098   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------
12394   cycles for MOV DL
6207    cycles for MMX/MOVD DWORD PTR
21093   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------

--- ok ---

Did I mispell something, or what?