News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

MASM FOR FUN - REBORN - #0 Extract low order bytes from dwords

Started by frktons, November 25, 2012, 02:48:06 AM

Previous topic - Next topic

dedndave

nope - i just write bad code   :lol:

LEA does not affect the flags, so
        lea     eax,[eax+1]
adds one to EAX without altering the flags that were set by DEC ECX
the idea was to put something in between the instruction that sets the flags and the one that examines them
but, LEA is not a great performer on older CPU's

still, it shouldn't be that slow - lol

dedndave

i got a slight improvement on my p4 prescott

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov eax, offset Dest
    mov ecx, 4096
    mov ebx, offset Source

@@:     mov     edx,[ebx]
        add     ebx,4
        mov     [eax],dl
        dec     ecx
        lea     eax,[eax+1]
        jnz     @B

; @@:
;     mov edx, [ebx]
;     mov byte ptr [eax], dl
;     add eax, 1
;                add ebx, 4
;     dec ecx
;     jnz @B
counter_end

sinsi


hutch--

Dave,

It was only the PIV that was a poor performer with LEA, PIII and earlier and Core2 onwards are fine.

dedndave

i knew it was something like that, Hutch - lol

sinsi has the right idea, i think...
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    mov ebx, offset Source
    mov eax, offset Dest
    mov ecx, 4096/4

@@:     mov     dh,[ebx+12]
        mov     dl,[ebx+8]
        shl     edx,16
        mov     dh,[ebx+4]
        mov     dl,[ebx]
        add     ebx,16
        mov     [eax],edx
        dec     ecx
        lea     eax,[eax+4]
        jnz     @B

counter_end


it would help if the destination array is 4-aligned - maybe the source, too

nidud

deleted

jj2007

So on your puter Sinsi's solution is clearly the fastest. That's what I suspected ;-)

Not on mine, however:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9968    cycles for MOV AX
9622    cycles for LEA
5181    cycles for MMX/MOVD DWORD PTR

frktons

On newer machine there is no game:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13874   cycles for MOV AX
13124   cycles for LEA
6193    cycles for MMX/MOVD DWORD PTR
18486   cycles for STOSB
---------------------------------------------------------
13856   cycles for MOV AX
13087   cycles for LEA
4129    cycles for MMX/MOVD DWORD PTR
18516   cycles for STOSB
---------------------------------------------------------

--- ok ---



later the pshufb solution that should win the race.

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Here it is, the first quick shot with PSHUFB:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
---------------------------------------------------------
13866   cycles for MOV AX
13084   cycles for LEA
6205    cycles for MMX/MOVD DWORD PTR
4848    cycles for PSHUFB / I shot
18487   cycles for STOSB
---------------------------------------------------------
13852   cycles for MOV AX
13083   cycles for LEA
6194    cycles for MMX/MOVD DWORD PTR
4730    cycles for PSHUFB / I shot
18518   cycles for STOSB
---------------------------------------------------------

--- ok ---


later I'll try to improve it.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

nidud

deleted

frktons

Some more tests:


Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz (SSSE3)
---------------------------------------------------------
13927   cycles for MOV AX
13097   cycles for LEA
6203    cycles for MMX/PUNPCKLBW
4729    cycles for XMM/PSHUFB - I shot
3518    cycles for XMM/PSHUFB - II shot
15364   cycles for XMM/MASKMOVDQU - I shot
18506   cycles for STOSB
---------------------------------------------------------
13868   cycles for MOV AX
13096   cycles for LEA
6198    cycles for MMX/PUNPCKLBW
4732    cycles for XMM/PSHUFB - I shot
3520    cycles for XMM/PSHUFB - II shot
15360   cycles for XMM/MASKMOVDQU - I shot
18503   cycles for STOSB
---------------------------------------------------------

--- ok ---
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O

nidud

deleted

frktons

Quote from: jj2007 on November 26, 2012, 03:25:54 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:

source     xxxCxxxIxxxAxxxO
wanted     CIAO
effective     C   I   A   O


If the mask is correctly set, maskmovdqu should do the job :

F0h = byte to move, 00h = byte non moved, according to Intel Docs:


The most significant bit in each byte of the mask operand determines whether the
corresponding byte in the source operand is written to the corresponding byte location
in memory: 0 indicates no write and 1 indicates write.
[/b]



At least my previous test showed it can do the job, but maybe I didn't try it enough. ::)

Edit: it only works fine with consecutive bytes, probably, as you said, not the one I need
beside being slow.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Quote from: nidud on November 26, 2012, 04:34:28 AM

QuoteIntel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
6334    cycles for MOV AX
5171    cycles for LEA
3123    cycles for MMX/MOVD DWORD PTR
2189    cycles for PSHUFB / I shot
10503   cycles for STOSB
---------------------------------------------------------
5243    cycles for MOV AX
5488    cycles for LEA
3150    cycles for MMX/MOVD DWORD PTR
2060    cycles for PSHUFB / I shot
9276    cycles for STOSB
---------------------------------------------------------


These SIMD instructions work a lot better with modern tech.
Try the last version only on I3.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama