News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fastest way to move 3 bytes into a dword

Started by frktons, January 21, 2013, 04:32:29 AM

Previous topic - Next topic

frktons

Next challenge on the way to enlightenment  is:

What is the fastest way to move a 3 bytes variable
into a dword variable/register?

And the reverse, of course.

Let's see what shows up this time  :P

I bet Dave will enjoy this one.  :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

mov eax,dword ptr var3bytes
and eax,0FFFFFFh

:P

frktons

Quote from: dedndave on January 21, 2013, 04:40:08 AM
mov eax,dword ptr var3bytes
and eax,0FFFFFFh

:P

Good idea Dave, I like it.  :t

The reverse operation is still missing. Dave, give us the light.

I'm very curious  to see if more solutions come up & what kind
of bit-imagination assembly programmers have developed  :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Siekmanski

#3
Quote from: Siekmanski on January 21, 2013, 07:30:41 AM

24bit to 32 bit,


* code modified,


.data
align 16
Buffer32               db 32*1024 dup (0)
ByteAndMask32          db -1,-1,-1,0,-1,-1,-1,0,-1,-1,-1,0,-1,-1,-1,0
ByteMask24BitSSE3      db 0,1,2,0,3,4,5,0,6,7,8,0,9,10,11,0

Bytes24bit             db 1,2,3,4,5,6,7,8,9,10,11,12 ; etc.

.code

lea eax,ByteMask24BitSSE3
movdqa xmm1,[eax]
lea eax,ByteAndMask32
movdqa xmm2,[eax]

lea esi,Bytes24bit
lea edi,Buffer32

; mov ecx,1024
align 16
conversion_loop:
movdqu xmm0,[esi]
pshufb xmm0,xmm1
pand xmm0,xmm2
movdqa [edi],xmm0
; movdqu xmm0,[esi+12]
; pshufb xmm0,xmm1
; pand xmm0,xmm2
; movdqa [edi+16],xmm0
; add esi,24
; add edi,32
; dec ecx
; jnz conversion_loop



1,2,3,4,5,6,7,8,9,10,11,12
results in:
00030201h 00060504h 00090807h 000c0b0ah
Creative coders use backward thinking techniques as a strategy.

frktons

Quote from: Siekmanski on January 21, 2013, 07:43:45 AM


.data
align 16
Buffer32               db 32*1024 dup (0)
ByteAndMask32          db -1,-1,-1,0,-1,-1,-1,0,-1,-1,-1,0,-1,-1,-1,0
ByteMask24BitSSE3      db 0,1,2,0,3,4,5,0,6,7,8,0,9,10,11,0

Bytes24bit             db 1,2,3,4,5,6,7,8,9,10,11,12 ; etc.

.code

lea eax,ByteMask24BitSSE3
movdqa xmm1,[eax]
lea eax,ByteAndMask32
movdqa xmm2,[eax]

lea esi,Bytes24bit
lea edi,Buffer32

; mov ecx,1024
align 16
conversion_loop:
movdqu xmm0,[esi]
pshufb xmm0,xmm1
pand xmm0,xmm2
movdqa [edi],xmm0
; movdqu xmm0,[esi+12]
; pshufb xmm0,xmm1
; pand xmm0,xmm2
; movdqa [edi+16],xmm0
; add esi,24
; add edi,32
; dec ecx,1
; jnz conversion_loop



1,2,3,4,5,6,7,8,9,10,11,12
results in:
00030201h 00060504h 00090807h 000c0b0ah

Very interesting Siekmanski, I want to test it to see the performance
of SSE code against traditional 32 bit code.   :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Siekmanski

32bit to 24 bit,

:biggrin:



.data
align 16
Bytes32bit             dd 00030201h,00060504h,00090807h,000c0b0ah ; etc.
ByteMask24BitSSE3_2    db 0,1,2,4,5,6,8,9,10,12,13,14,3,3,3,3
Buffer24        db 24*1024+4 dup (0)


.code

lea eax,ByteMask24BitSSE3_2
movdqa xmm1,[eax]

lea esi,Bytes32bit
lea edi,Buffer24

; mov ecx,1024
align 16
conversion_loop2:
movdqa xmm0,[esi]
pshufb xmm0,xmm1
movdqu [edi],xmm0
; movdqa xmm0,[esi+16]
; pshufb xmm0,xmm1
; movdqu [edi+12],xmm0
; add esi,32
; add edi,24
; dec ecx
; jnz conversion_loop2



00030201h 00060504h 00090807h 000c0b0ah
results in:
1,2,3,4,5,6,7,8,9,10,11,12
Creative coders use backward thinking techniques as a strategy.

frktons

Quote from: Siekmanski on January 21, 2013, 09:05:40 AM
32bit to 24 bit,

:biggrin:



.data
align 16
Bytes32bit             dd 00030201h,00060504h,00090807h,000c0b0ah ; etc.
ByteMask24BitSSE3_2    db 0,1,2,4,5,6,8,9,10,12,13,14,3,3,3,3
Buffer24        db 24*1024+4 dup (0)


.code

lea eax,ByteMask24BitSSE3_2
movdqa xmm1,[eax]

lea esi,Bytes32bit
lea edi,Buffer24

; mov ecx,1024
align 16
conversion_loop2:
movdqa xmm0,[esi]
pshufb xmm0,xmm1
movdqu [edi],xmm0
; movdqa xmm0,[esi+16]
; pshufb xmm0,xmm1
; movdqu [edi+12],xmm0
; add esi,32
; add edi,24
; dec ecx,1
; jnz conversion_loop2



00030201h 00060504h 00090807h 000c0b0ah
results in:
1,2,3,4,5,6,7,8,9,10,11,12


Yes, this was the missing part. Very nice. I'll test them ASAP.   :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Siekmanski

 :icon_redface: Made a stupid mistake,

dec ecx,1

must be: dec ecx

sources modified....   :biggrin:
Creative coders use backward thinking techniques as a strategy.

Siekmanski

To get faster results,unrole the conversionloop 3 times to run in L1 cache (64 byte)
This 24 bit to 32 bit routine is now 60 bytes long and fits in the L1 cache.
( The 32 bit to 24 bit routine unroled 3 times is 65 byte so 1 byte to big to fit the L1 cache.)


mov ecx,128
align 16
conversion_loop2:
movdqa xmm0,[esi]
pshufb xmm0,xmm1
movdqu [edi],xmm0
movdqa xmm0,[esi+16]
pshufb xmm0,xmm1
movdqu [edi+12],xmm0
movdqa xmm0,[esi+32]
pshufb xmm0,xmm1
movdqu [edi+24],xmm0
add esi,48
add edi,36
dec ecx
jnz conversion_loop2


Creative coders use backward thinking techniques as a strategy.

frktons

Quote from: Siekmanski on January 21, 2013, 10:02:06 AM
To get faster results,unrole the conversionloop 3 times to run in L1 cache (64 byte)
This 24 bit to 32 bit routine is now 60 bytes long and fits in the L1 cache.
( The 32 bit to 24 bit routine unroled 3 times is 65 byte so 1 byte to big to fit the L1 cache.)


mov ecx,128
align 16
conversion_loop2:
movdqa xmm0,[esi]
pshufb xmm0,xmm1
movdqu [edi],xmm0
movdqa xmm0,[esi+16]
pshufb xmm0,xmm1
movdqu [edi+12],xmm0
movdqa xmm0,[esi+32]
pshufb xmm0,xmm1
movdqu [edi+24],xmm0
add esi,48
add edi,36
dec ecx
jnz conversion_loop2




I'm preparing the test program, but I'm not sure it'll be ready soon.
It is night and I'm almost sleeping.   :dazzled:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

This is the structure of the test program:

;==============================================================================
; Test_mov3todw.asm
; ------------------------------------------------------------------------
; Example to test instructions that mov 3 bytes vars into dwords.
; The test uses 48 bytes string to be read 3 bytes each time, into 16 DW.
; ------------------------------------------------------------------------
; Frktons 20-jan-2013 @Masm32 Forum
;==============================================================================
include \masm32\include\masm32rt.inc
;==============================================================================
.nolist
.686
.xmm

include \masm32\macros\timers.asm
; get them from the
;[url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]

AxCPUid_Print PROTO

LOOP_COUNT EQU 1000

include \masm32\include\MyLib.inc


;==============================================================================
.data

    align 16
    Area  DB "Here it is a string with 48 characters inside me",0
    AreaLn = ($ - Area - 1)
    align Four   
    AreaLen    dd 0
    Counter    dd 0
    PtrSource  dd Area
    PtrDest    dd ArrayDW
 

    align Four
    LineSep     db  72 dup("-"),0,0,0,0

    align Four
    PtrLineSep  dd  LineSep
 

.data?

    align 16
    ArrayDW  dd 16 DUP (?)
    align Four
    CPU_Count DD  ?             ; Number of Cycles elapsed   
   


   
.code
;==============================================================================
align Four
MovProc proc

    mov edx, 1000      ; Number of cycles to perform

align Four   
TotCycles:

    mov esi, PtrSource
    mov edi, PtrDest
    mov ecx, 16
align Four   
cycle:
    mov eax,   [esi]
    and eax,   00FFFFFFH
    mov [edi], eax
    add esi,   3
    add edi,   Four

    dec ecx
    jnz cycle

    dec edx
    jnz TotCycles


    ret

MovProc endp

;==============================================================================
align Four
DisplayArrayDW  proc

    mov  ecx, 0
    mov  edx, PtrDest

Display:

    pushad

    print DWORD PTR edx

    popad
   
    add   edx, Four
    inc   ecx
    cmp   ecx, 16
    jnz   Display

    ret

DisplayArrayDW  endp
;==============================================================================
align Four
Main proc



    invoke GetLocaleInfo,LOCALE_USER_DEFAULT,LOCALE_STHOUSAND,offset Tsep,Four
    invoke CharToOem,offset Tsep,offset Tsep

    CALL   FillMyArray

    CALL   FillMyArray0

    INVOKE ConsoleSize, 40, 100

    print PtrLineSep, 13, 10
   
    invoke AxCPUid_Print

    print PtrLineSep, 13, 10

    REPEAT Four

;---------------------------------------------------------------------------------

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
       
      CALL MovProc

counter_end

      mov  edi, PtrFmtNum16
      lea  esi, InitString
      movdqa  xmm0, [esi]
      movdqa  [edi], xmm0

      INVOKE FormatNumDW, eax, PtrFmtNum16
     
print PtrFmtNum16, 9,  "cycles for Dave - MOV 4 bytes / AND", 13, 10

;---------------------------------------------------------------------------------

      print PtrLineSep, 13, 10
     
       
    ENDM 
   
;    CALL DisplayArrayDW
 

    ret

Main endp


;-------------------------------------------------------------

    include AxCPUid.inc

;-------------------------------------------------------------

;==============================================================================
start:
;==============================================================================

;==============================================================================
    call Main

    inkey
    exit
;==============================================================================
end start


If you are still awake, try to adapt your code for the task.
Attached the files you need.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Siekmanski

Inserted my routines.

------------------------------------------------------------------------
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
       49.992   cycles for Dave - 48 bytes MOV 4 bytes / AND
------------------------------------------------------------------------
       48.133   cycles for Dave - 48 bytes MOV 4 bytes / AND
------------------------------------------------------------------------
       48.143   cycles for Dave - 48 bytes MOV 4 bytes / AND
------------------------------------------------------------------------
       48.144   cycles for Dave - 48 bytes MOV 4 bytes / AND
------------------------------------------------------------------------
       23.103   cycles for Siekmanski - 48 bytes SSSE3_24_32
------------------------------------------------------------------------
       23.152   cycles for Siekmanski - 48 bytes SSSE3_24_32
------------------------------------------------------------------------
       23.137   cycles for Siekmanski - 48 bytes SSSE3_24_32
------------------------------------------------------------------------
       23.136   cycles for Siekmanski - 48 bytes SSSE3_24_32
------------------------------------------------------------------------
       20.138   cycles for Siekmanski - 48 bytes SSSE3_24_32 unroled
------------------------------------------------------------------------
       20.124   cycles for Siekmanski - 48 bytes SSSE3_24_32 unroled
------------------------------------------------------------------------
       20.130   cycles for Siekmanski - 48 bytes SSSE3_24_32 unroled
------------------------------------------------------------------------
       20.137   cycles for Siekmanski - 48 bytes SSSE3_24_32 unroled
------------------------------------------------------------------------
       23.137   cycles for Siekmanski - 48 bytes SSSE3_32_24
------------------------------------------------------------------------
       23.126   cycles for Siekmanski - 48 bytes SSSE3_32_24
------------------------------------------------------------------------
       23.137   cycles for Siekmanski - 48 bytes SSSE3_32_24
------------------------------------------------------------------------
       23.137   cycles for Siekmanski - 48 bytes SSSE3_32_24
------------------------------------------------------------------------
       19.139   cycles for Siekmanski - 48 bytes SSSE3_32_24 unroled
------------------------------------------------------------------------
       19.130   cycles for Siekmanski - 48 bytes SSSE3_32_24 unroled
------------------------------------------------------------------------
       19.138   cycles for Siekmanski - 48 bytes SSSE3_32_24 unroled
------------------------------------------------------------------------
       19.138   cycles for Siekmanski - 48 bytes SSSE3_32_24 unroled
------------------------------------------------------------------------
Creative coders use backward thinking techniques as a strategy.

KeepingRealBusy

I looked at some of this code and noticed some new instructions that I had never seen before, and then it dawned on me, I don't have the manuals for my new quad A8 3520M cpu, I only have the manuals for my old dual core system.

Does anyone have a good link to AMD to get the CORRECT manuals for this CPU?

Dave.

sinsi

Try here: http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
Tá fuinneoga a haon déag níos fearr :biggrin:

dedndave

i have a prescott that supports SSE3, Marinus
crashes at PSHUFB XMM0,XMM1   :P

87,665 cycles for the first test, though