Sorting strings

guga · July 04, 2014, 04:00:40 AM

Hi Rui.

Ok, many thanks.Here is an updated version that is a bit more faster (For large data, of course)



Proc memcpy_SSE_V2:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bits at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        ; emms ; clear the registers back to use on FPU <--- Removed tks to JJ. Old CPU instruction uneeded
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        ; mov eax eax ; fix potential stallings <--- Not needed. There is no stall.
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP

Gunther · July 04, 2014, 04:07:52 AM

Hi Rui,

Quote from: RuiLoureiro on July 04, 2014, 03:52:27 AM
Gunther, could you run it, please ?
Thanks

Yes, of course. The results are in the attachment.

Gunther

RuiLoureiro · July 04, 2014, 05:45:57 AM

Thank you, Gunther :t

Here SSE is far better
in all cases

----------------------------
Results from Gunther
----------------------------

Quote
INSERTING AT POSITION 0 -string length=100
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

7 milliseconds, INSAtoB_SSEE- 2 bytes
8 milliseconds, INSAtoB_SSEE- 4 bytes
8 milliseconds, INSAtoB_SSEE- 1 bytes
9 milliseconds, INSAtoB_SSEE- 15 bytes
10 milliseconds, INSAtoB_SSEE- 3 bytes
13 milliseconds, INSAtoB_SSEE- 55 bytes

14 milliseconds, INSAtoB_XZZE- 4 bytes
15 milliseconds, INSAtoB_SSEE- 103 bytes
15 milliseconds, INSAtoB_XZZE- 2 bytes
20 milliseconds, INSAtoB_XZZF- 4 bytes
21 milliseconds, INSAtoB_XZZF- 1 bytes
21 milliseconds, INSAtoB_SSEE- 203 bytes
21 milliseconds, INSAtoB_XZZF- 2 bytes
21 milliseconds, INSAtoB_XZZE- 1 bytes
22 milliseconds, INSAtoB_XZZE- 3 bytes
22 milliseconds, INSAtoB_XZZE- 55 bytes
22 milliseconds, INSAtoB_XZZF- 3 bytes
24 milliseconds, INSAtoB_XZZF- 15 bytes
25 milliseconds, INSAtoB_XZZE- 15 bytes
29 milliseconds, INSAtoB_XZZF- 55 bytes
34 milliseconds, INSAtoB_SSEE- 503 bytes
34 milliseconds, INSAtoB_XZZE- 103 bytes
35 milliseconds, INSAtoB_XZZF- 103 bytes
44 milliseconds, INSAtoB_SSEE-1027 bytes

45 milliseconds, INSAtoB_XZZE- 203 bytes
52 milliseconds, INSAtoB_XZZF- 203 bytes
59 milliseconds, INSAtoB_X- 3 bytes
65 milliseconds, INSAtoB_X- 2 bytes
66 milliseconds, INSAtoB_X- 4 bytes
67 milliseconds, INSAtoB_X- 15 bytes
72 milliseconds, INSAtoB_BB- 1 bytes
72 milliseconds, INSAtoB_X- 1 bytes
74 milliseconds, INSAtoB_BB- 3 bytes
83 milliseconds, INSAtoB_BB- 15 bytes
86 milliseconds, INSAtoB_BB- 2 bytes
88 milliseconds, INSAtoB_BB- 4 bytes
90 milliseconds, INSAtoB_XZZE- 503 bytes
91 milliseconds, INSAtoB_XZZF- 503 bytes
98 milliseconds, INSAtoB_X- 55 bytes
115 milliseconds, INSAtoB_X- 103 bytes
121 milliseconds, INSAtoB_BA- 1 bytes
123 milliseconds, INSAtoB_BA- 3 bytes
123 milliseconds, INSAtoB_BA- 2 bytes
125 milliseconds, INSAtoB_BA- 4 bytes
138 milliseconds, INSAtoB_BB- 55 bytes
139 milliseconds, INSAtoB_BA- 15 bytes
146 milliseconds, INSAtoB_BB- 103 bytes
151 milliseconds, INSAtoB_XZZE-1027 bytes
159 milliseconds, INSAtoB_XZZF-1027 bytes
174 milliseconds, INSAtoB_X- 203 bytes
190 milliseconds, INSAtoB_BA- 55 bytes
230 milliseconds, INSAtoB_BA- 103 bytes
234 milliseconds, INSAtoB_BB- 203 bytes
321 milliseconds, INSAtoB_X- 503 bytes
343 milliseconds, INSAtoB_BA- 203 bytes
404 milliseconds, INSAtoB_BB- 503 bytes
598 milliseconds, INSAtoB_X- 1027 bytes
642 milliseconds, INSAtoB_BA- 503 bytes
765 milliseconds, INSAtoB_BB- 1027 bytes
1192 milliseconds, INSAtoB_BA- 1027 bytes
********** END III **********

RuiLoureiro · July 04, 2014, 06:15:32 AM

Hi
What is «movupd» ?

Where can i see this instructions (a good reference)

I am not able to assemble this:
movups xmm1, [esi+edx*8]

I GET: instruction or register not accepted in current CPU mode

dedndave · July 04, 2014, 06:23:17 AM

http://x86.renejeschke.de/html/file_module_x86_id_207.html

http://x86.renejeschke.de/

qWord · July 04, 2014, 06:38:53 AM

Quote from: RuiLoureiro on July 04, 2014, 06:15:32 AMWhere can i see this instructions (a good reference)

I am not able to assemble this:

Did you never heard of Intel's and AMD's developer manuals?

Quote from: RuiLoureiro on July 04, 2014, 06:15:32 AM
movups xmm1, [esi+edx*8]

I GET: instruction or register not accepted in current CPU mode

MOVUPD is an SSE2 instruction, which are supported since MASM v6.15 (6.14 does only support SSE). Also I would recommend you to use MOVDQU instead, because you are not dealing with FP data.

RuiLoureiro · July 04, 2014, 07:15:41 AM

qWord,
«Did you never heard of Intel's and AMD's developer manuals?»
No, noone knows it ! :P
I want to use movupd because guga used it.
Thanks.
Dave,
Thanks :t

guga · July 04, 2014, 07:59:13 AM

Great idea Qword. Tks for reminding :t.

It is always worth to keep compatibility for olders CPUs...but....I don´t recall of a unaligned instruction for SSE - except movups that is for FPU data. Does SSE1 have similar instruction movdqu (SSE2) ?

Rui, you may use movdqu insetad. It won´t change anything the performance (If i recall well on Agner´s Fog recommendations )

Quote from: Agner´s Fog Optimization Manual
"Using unaligned read instructions
The instructions MOVDQU, MOVUPS, MOVUPD and LDDQU are all able to read unaligned vectors.
LDDQU is faster than the alternatives on P4E and PM processors, but not on any later processors. The unaligned read instructions are relatively slow on older Intel processors
and on Intel Atom, but fast on Nehalem and later Intel processors as well as on AMD and VIA processors.
; Example 13.10. Unaligned vector read
; esi contains pointer to unaligned array
movdqu xmm0, [esi] ; Read vector unaligned
On contemporary processors, there is no penalty for using the unaligned instruction MOVDQU rather than the aligned MOVDQA if the data are in fact aligned. Therefore, it is convenient to
use MOVDQU if you are not sure whether the data are aligned or not."

Ref: http://www.agner.org/optimize/blog/read.php?i=285

So, you can try using the following code:

Code Select



Proc memcpy_SSE_V3:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP

guga · July 04, 2014, 08:49:49 AM

Another version that uses SSE1 here seems a bit faster. (About 3% faster), but still uses as FPU (I couldn´t find an alternative for movups yet :( )

Try this....

Code Select


Proc memcpy_SSE_V4:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            lddqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movups X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP

nidud · July 04, 2014, 09:04:28 AM

deleted

RuiLoureiro · July 04, 2014, 07:57:32 PM

Hi guga,
The problem is this:

I am not able to assemble this instruction
movups xmm1, [esi+edx*8]
I GET: instruction or register not accepted in current CPU mode

guga · July 05, 2014, 01:19:56 AM

I don´t remember the masm syntax completelly, but you can try this initializers

http://masm32.com/board/index.php?topic=1217.0
and
http://www.masmforum.com/board/index.php?PHPSESSID=786dd40408172108b65a5a36b09c88c0&topic=15872.0

Code Select


;###############################################################################################

        .XCREF
        .NoList
        INCLUDE    \Masm32\Include\Masm32rt.inc
        .686p
        .MMX
        .XMM
        INCLUDE    \Masm32\Macros\Timers.asm
        .List

;###############################################################################################

And, of course, use v 6.15

guga · July 05, 2014, 02:02:05 AM

Ok, found it...It assembled correctly as JJ suggested here
http://masm32.com/board/index.php?topic=3369.0;topicseen

Rui, try to assemble this

Code Select


include \masm32\include\masm32rt.inc
.686
.xmm

.code

start:   MsgBox 0, "Hello World", "Hi", MB_OK
   exit
   movups      xmm0, [esi]
end start

If you suceed, then you can be able to assemble the timming app with the SSE instructions

dedndave · July 05, 2014, 03:12:18 AM

prescott w/htt XP SP3

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
44875      cycles - a   1..256  (  0) crt_memcpy
37179      cycles - a   1..256  ( 89) regcopy
76425      cycles - a   1..256  ( 48) memcpy SSE
74412      cycles - a   1..256  (171) memcpyxmmU SSE
73665      cycles - a   1..256  ( 91) memcpy_SSE_V2

53738      cycles - u   1..256  (  0) crt_memcpy
70189      cycles - u   1..256  ( 89) regcopy
110211     cycles - u   1..256  ( 48) memcpy SSE
97812      cycles - u   1..256  (171) memcpyxmmU SSE
80674      cycles - u   1..256  ( 91) memcpy_SSE_V2

904814     cycles - a 400..2000 (  0) crt_memcpy
1144614    cycles - a 400..2000 ( 89) regcopy
1813848    cycles - a 400..2000 ( 48) memcpy SSE
1651549    cycles - a 400..2000 (171) memcpyxmmU SSE
1723459    cycles - a 400..2000 ( 91) memcpy_SSE_V2

1887691    cycles - u 400..2000 (  0) crt_memcpy
4090627    cycles - u 400..2000 ( 89) regcopy
4085473    cycles - u 400..2000 ( 48) memcpy SSE
4061232    cycles - u 400..2000 (171) memcpyxmmU SSE
3912585    cycles - u 400..2000 ( 91) memcpy_SSE_V2

guga · July 05, 2014, 03:54:11 AM

Can someone please test this zeromem routine ?

Code Select




Proc ZeroMem_SSE:
    Arguments @pMem, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pMem
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        pxor XMM1 XMM1 ; clear XMM1 register
        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos

L3:  mov B$edi+eax-1 0 | dec eax | jnz L3<

L4:

EndP

The MASM Forum

News:

Sorting strings

guga

Gunther

RuiLoureiro

RuiLoureiro

dedndave

qWord

RuiLoureiro

guga

guga

nidud

RuiLoureiro

guga

guga

dedndave

guga