MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords

frktons · November 25, 2012, 02:48:06 AM

Hi everybody.

It has been a long time since the last MASM FOR FUN adventure.

Time to be back?

I hope so, at least for a while.

Let's start with a question.

I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.

The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.

I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.

Anybody has gone through this problem and found an SSE2/SSE3 solution?

In pseudocode:

Code Select


xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0,  n

And we have CIAO in eax.

CIAO

Magnum · November 25, 2012, 03:05:34 AM

Great to see you back.

frktons · November 25, 2012, 04:06:39 AM

Quote from: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.

Thanks, it's good to be back.

With the old instructions I get:

Code Select


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098   cycles for MOV DL

28009   cycles for MOV DL


--- ok ---

Attached the source and exe.

Frank

Magnum · November 25, 2012, 04:18:08 AM

Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL

8372 cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

frktons · November 25, 2012, 04:29:02 AM

Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL

8372 cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.

Frank

frktons · November 25, 2012, 04:56:19 AM

I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:

if I have four xmm registers with the dword read from the dword buffer

Code Select



    movd      xmm0,  [ebx]
    movd      xmm1,  [ebx + 4]    
    movd      xmm2,  [ebx + 8]  
    movd      xmm3,  [ebx + 12]

interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.

While I wait for some suggestions, I carry on thinking and reading :icon_rolleyes:

Frank

CommonTater · November 25, 2012, 06:20:01 AM

Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL

8372 cycles for MOV DL

Does having an extra core really make that much difference ?

Andy

Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.

However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores. In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.

frktons · November 25, 2012, 06:57:46 AM

Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.

Code Select


		mov eax, offset Dest
		mov ebx, offset Source 
		@@:
		movd mm0, dword ptr [ebx]
                movd mm1, dword ptr [ebx + 4]
                movd mm2, dword ptr [ebx + 8]
                movd mm3, dword ptr [ebx + 12]

                punpcklbw mm0, mm2
                punpcklbw mm1, mm3   
                punpcklbw mm0, mm1
                  
		movd dword ptr [eax], mm0

frktons · November 25, 2012, 07:16:57 AM

Well, it apparently works with some speed improvement:

Code Select


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315   cycles for MOV DL

16479   cycles for MOVD DWORD PTR

28394   cycles for MOV DL

16637   cycles for MOVD DWORD PTR


--- ok ---

I'll see if a there is a better method, with less SSE OPCODES to get
better performance.

Frank

jj2007 · November 25, 2012, 07:50:00 AM

Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
include \masm32\include\masm32rt.inc
.686p
.xmm

... and you need a modern CPU ;-)

frktons · November 25, 2012, 08:04:02 AM

Quote from: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)

What you are looking for is pshufb. There is a testpiece by Hutch here.

You need to add this:
include \masm32\include\masm32rt.inc
.686p
.xmm

... and you need a modern CPU ;-)

Ciao jj.

I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.

Frank

jj2007 · November 25, 2012, 09:44:27 AM

Quote from: frktons on November 25, 2012, 08:04:02 AM
What do you think about the solution I used?

Not fast indeed. PSHUFB must be much better, but I can't test it here...

See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.

Did you try the straightforward solution?

include \masm32\include\masm32rt.inc
.data
src   db "xxxCxxxIxxxAxxxO", 0
dest   db 20 dup(?)
.code
start:
   mov esi, offset src+3
   mov edi, offset dest
   REPEAT 4
      lodsd
      stosb
   ENDM
   inkey offset dest
   exit
end start

frktons · November 25, 2012, 11:52:26 AM

I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12412   cycles for MOV DL
6199    cycles for MMX/MOVD DWORD PTR
21114   cycles for STOSB
---------------------------------------------------------
12401   cycles for MOV DL
6235    cycles for MMX/MOVD DWORD PTR
21137   cycles for STOSB
---------------------------------------------------------

--- ok ---

Frank

dedndave · November 25, 2012, 11:56:38 AM

hiyas Frank - good to see you :t

this might have a few less dependancies...

Code Select

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

frktons · November 25, 2012, 12:22:09 PM

Quote from: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you :t

this might have a few less dependancies...

Code Select Expand
@@: mov edx,[ebx] add ebx,4 mov [eax],dl dec ecx lea eax,[eax+1] jnz @B

Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about JNZ:

Code Select


        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

[/code]

does the jnz refers to ECX
or to EAX?

By the way, I inserted your code inside the pgm, but I get strange results:

Code Select



		invoke Sleep, 100
		counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

		mov eax, offset Dest
		mov ecx, 4096
		mov ebx, offset Source 

    @@:     mov     edx,[ebx]
            add     ebx,4
            mov     [eax],dl
            dec     ecx            
            lea     eax,[eax+1]            

            jnz     @B
            
            print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10

and:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
12396   cycles for MOV DL
6193    cycles for MMX/MOVD DWORD PTR
21098   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------
12394   cycles for MOV DL
6207    cycles for MMX/MOVD DWORD PTR
21093   cycles for STOSB
20677248        cycles for DAVE

---------------------------------------------------------

--- ok ---

Did I mispell something, or what?

The MASM Forum

News:

MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords

frktons

Magnum

frktons

Magnum

frktons

frktons

CommonTater

frktons

frktons

jj2007

frktons

jj2007

frktons

dedndave

frktons