News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Sorting strings

Started by RuiLoureiro, May 29, 2014, 06:15:48 AM

Previous topic - Next topic

guga

Hi Rui.

Ok, many thanks.Here is an updated version that is a bit more faster (For large data, of course)




Proc memcpy_SSE_V2:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bits at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        ; emms ; clear the registers back to use on FPU <--- Removed tks to JJ. Old CPU instruction uneeded
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        ; mov eax eax ; fix potential stallings <--- Not needed. There is no stall.
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Gunther

Hi Rui,

Quote from: RuiLoureiro on July 04, 2014, 03:52:27 AM
    Gunther, could you run it, please ?
    Thanks

Yes, of course. The results are in the attachment.

Gunther
You have to know the facts before you can distort them.

RuiLoureiro

Thank you, Gunther  :t

Here SSE is far better
in all cases

----------------------------
Results from Gunther
----------------------------

Quote
INSERTING AT POSITION 0 -string length=100
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

   7 milliseconds, INSAtoB_SSEE-   2 bytes
   8 milliseconds, INSAtoB_SSEE-   4 bytes
   8 milliseconds, INSAtoB_SSEE-   1 bytes
   9 milliseconds, INSAtoB_SSEE-  15 bytes
  10 milliseconds, INSAtoB_SSEE-   3 bytes
  13 milliseconds, INSAtoB_SSEE55 bytes

  14 milliseconds, INSAtoB_XZZE-   4 bytes
  15 milliseconds, INSAtoB_SSEE- 103 bytes
  15 milliseconds, INSAtoB_XZZE-   2 bytes
  20 milliseconds, INSAtoB_XZZF-   4 bytes
  21 milliseconds, INSAtoB_XZZF-   1 bytes
  21 milliseconds, INSAtoB_SSEE- 203 bytes
  21 milliseconds, INSAtoB_XZZF-   2 bytes
  21 milliseconds, INSAtoB_XZZE-   1 bytes
  22 milliseconds, INSAtoB_XZZE-   3 bytes
  22 milliseconds, INSAtoB_XZZE55 bytes
  22 milliseconds, INSAtoB_XZZF-   3 bytes
  24 milliseconds, INSAtoB_XZZF-  15 bytes
  25 milliseconds, INSAtoB_XZZE-  15 bytes
  29 milliseconds, INSAtoB_XZZF-  55 bytes
  34 milliseconds, INSAtoB_SSEE- 503 bytes
  34 milliseconds, INSAtoB_XZZE- 103 bytes
  35 milliseconds, INSAtoB_XZZF- 103 bytes
  44 milliseconds, INSAtoB_SSEE-1027 bytes

  45 milliseconds, INSAtoB_XZZE- 203 bytes
  52 milliseconds, INSAtoB_XZZF- 203 bytes
  59 milliseconds, INSAtoB_X-      3 bytes
  65 milliseconds, INSAtoB_X-      2 bytes
  66 milliseconds, INSAtoB_X-      4 bytes
  67 milliseconds, INSAtoB_X-     15 bytes
  72 milliseconds, INSAtoB_BB-     1 bytes
  72 milliseconds, INSAtoB_X-      1 bytes
  74 milliseconds, INSAtoB_BB-     3 bytes
  83 milliseconds, INSAtoB_BB-    15 bytes
  86 milliseconds, INSAtoB_BB-     2 bytes
  88 milliseconds, INSAtoB_BB-     4 bytes
  90 milliseconds, INSAtoB_XZZE- 503 bytes
  91 milliseconds, INSAtoB_XZZF- 503 bytes
  98 milliseconds, INSAtoB_X-     55 bytes
115 milliseconds, INSAtoB_X-    103 bytes
121 milliseconds, INSAtoB_BA-     1 bytes
123 milliseconds, INSAtoB_BA-     3 bytes
123 milliseconds, INSAtoB_BA-     2 bytes
125 milliseconds, INSAtoB_BA-     4 bytes
138 milliseconds, INSAtoB_BB-    55 bytes
139 milliseconds, INSAtoB_BA-    15 bytes
146 milliseconds, INSAtoB_BB-   103 bytes
151 milliseconds, INSAtoB_XZZE-1027 bytes
159 milliseconds, INSAtoB_XZZF-1027 bytes
174 milliseconds, INSAtoB_X-    203 bytes
190 milliseconds, INSAtoB_BA-    55 bytes
230 milliseconds, INSAtoB_BA-   103 bytes
234 milliseconds, INSAtoB_BB-   203 bytes
321 milliseconds, INSAtoB_X-    503 bytes
343 milliseconds, INSAtoB_BA-   203 bytes
404 milliseconds, INSAtoB_BB-   503 bytes
598 milliseconds, INSAtoB_X-   1027 bytes
642 milliseconds, INSAtoB_BA-   503 bytes
765 milliseconds, INSAtoB_BB-  1027 bytes
1192 milliseconds, INSAtoB_BA-  1027 bytes
********** END III **********

RuiLoureiro

Hi
        What is «movupd» ?

        Where can i see this instructions (a good reference)

        I am not able to assemble this:
        movups      xmm1, [esi+edx*8]

        I GET: instruction or register not accepted in current CPU mode


qWord

Quote from: RuiLoureiro on July 04, 2014, 06:15:32 AMWhere can i see this instructions (a good reference)

        I am not able to assemble this:
Did you never heard of Intel's and AMD's developer manuals?

Quote from: RuiLoureiro on July 04, 2014, 06:15:32 AM
        movups      xmm1, [esi+edx*8]

        I GET: instruction or register not accepted in current CPU mode
MOVUPD is an SSE2 instruction, which are supported since MASM v6.15  (6.14 does only support SSE). Also I would recommend you to use MOVDQU instead, because you are not dealing with FP data.
MREAL macros - when you need floating point arithmetic while assembling!

RuiLoureiro

qWord,
             «Did you never heard of Intel's and AMD's developer manuals?»
              No, noone knows it ! :P
              I want to use movupd because guga used it.
              Thanks.
Dave,
              Thanks  :t

guga

Great idea Qword. Tks for reminding  :t.

It is always worth to keep compatibility for olders CPUs...but....I don´t recall of a unaligned instruction for SSE - except movups that is for  FPU data. Does SSE1 have similar instruction movdqu (SSE2) ?

Rui, you may use movdqu insetad. It won´t change anything the performance (If i recall well on Agner´s Fog recommendations )

Quote from: Agner´s Fog Optimization Manual
"Using unaligned read instructions
The instructions MOVDQU, MOVUPS, MOVUPD and LDDQU are all able to read unaligned vectors.
LDDQU is faster than the alternatives on P4E and PM processors, but not on any later processors. The unaligned read instructions are relatively slow on older Intel processors
and on Intel Atom, but fast on Nehalem and later Intel processors as well as on AMD and VIA processors.
; Example 13.10. Unaligned vector read
; esi contains pointer to unaligned array
movdqu xmm0, [esi] ; Read vector unaligned
On contemporary processors, there is no penalty for using the unaligned instruction MOVDQU rather than the aligned MOVDQA if the data are in fact aligned. Therefore, it is convenient to
use MOVDQU if you are not sure whether the data are aligned or not."

Ref: http://www.agner.org/optimize/blog/read.php?i=285

So, you can try using the following code:



Proc memcpy_SSE_V3:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP


Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Another version that uses SSE1 here seems a bit faster. (About 3% faster), but still uses as FPU (I couldn´t find an alternative for movups yet :( )

Try this....


Proc memcpy_SSE_V4:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            lddqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movups X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

#99
deleted

RuiLoureiro

Hi guga,
               The problem is this:

               I am not able to assemble this instruction
               movups      xmm1, [esi+edx*8]               
               I GET: instruction or register not accepted in current CPU mode

guga

I don´t remember the masm syntax completelly, but you can try this initializers

http://masm32.com/board/index.php?topic=1217.0
and
http://www.masmforum.com/board/index.php?PHPSESSID=786dd40408172108b65a5a36b09c88c0&topic=15872.0


;###############################################################################################

        .XCREF
        .NoList
        INCLUDE    \Masm32\Include\Masm32rt.inc
        .686p
        .MMX
        .XMM
        INCLUDE    \Masm32\Macros\Timers.asm
        .List

;###############################################################################################


And, of course, use v 6.15
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, found it...It assembled correctly as JJ suggested here
http://masm32.com/board/index.php?topic=3369.0;topicseen


Rui, try to assemble this


include \masm32\include\masm32rt.inc
.686
.xmm

.code

start:   MsgBox 0, "Hello World", "Hi", MB_OK
   exit
   movups      xmm0, [esi]
end start


If you suceed, then you can be able to assemble the timming app with the SSE instructions
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

dedndave

prescott w/htt XP SP3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
44875      cycles - a   1..256  (  0) crt_memcpy
37179      cycles - a   1..256  ( 89) regcopy
76425      cycles - a   1..256  ( 48) memcpy SSE
74412      cycles - a   1..256  (171) memcpyxmmU SSE
73665      cycles - a   1..256  ( 91) memcpy_SSE_V2

53738      cycles - u   1..256  (  0) crt_memcpy
70189      cycles - u   1..256  ( 89) regcopy
110211     cycles - u   1..256  ( 48) memcpy SSE
97812      cycles - u   1..256  (171) memcpyxmmU SSE
80674      cycles - u   1..256  ( 91) memcpy_SSE_V2

904814     cycles - a 400..2000 (  0) crt_memcpy
1144614    cycles - a 400..2000 ( 89) regcopy
1813848    cycles - a 400..2000 ( 48) memcpy SSE
1651549    cycles - a 400..2000 (171) memcpyxmmU SSE
1723459    cycles - a 400..2000 ( 91) memcpy_SSE_V2

1887691    cycles - u 400..2000 (  0) crt_memcpy
4090627    cycles - u 400..2000 ( 89) regcopy
4085473    cycles - u 400..2000 ( 48) memcpy SSE
4061232    cycles - u 400..2000 (171) memcpyxmmU SSE
3912585    cycles - u 400..2000 ( 91) memcpy_SSE_V2

guga

Can someone please test this zeromem routine ?




Proc ZeroMem_SSE:
    Arguments @pMem, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pMem
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        pxor XMM1 XMM1 ; clear XMM1 register
        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos

L3:  mov B$edi+eax-1 0 | dec eax | jnz L3<

L4:

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com