News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Faster Memcopy ...

Started by rrr314159, March 03, 2015, 02:40:50 PM

Previous topic - Next topic

Gunther

Hi nidud,

Quote from: nidud on March 24, 2015, 01:13:21 AM
I converted and tested both of them.
The Intel files are 8.asm and 9.asm in the archive

thank you, that saves me a bit work. Here are the results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- string length 0..40
444941    cycles - (  0) proc_0: crt_strlen
231197    cycles - (105) proc_1: SSE 32
218271    cycles - (104) proc_2: AxStrLenSSE
210172    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
228546    cycles - (673) proc_8: SSE Intel Silvermont
374264    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
493818    cycles - (  0) proc_0: crt_strlen
155729    cycles - (105) proc_1: SSE 32
152824    cycles - (104) proc_2: AxStrLenSSE
148059    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
141848    cycles - (673) proc_8: SSE Intel Silvermont
263988    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
341526    cycles - (  0) proc_0: crt_strlen
69545     cycles - (105) proc_1: SSE 32
85672     cycles - (104) proc_2: AxStrLenSSE
87101     cycles - ( 86) proc_3: AxStrLenSSE (rrr)
43498     cycles - (673) proc_8: SSE Intel Silvermont
57624     cycles - (818) proc_9: SSE Intel Atom

result:
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen
hit any key to continue...


Gunther
You have to know the facts before you can distort them.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1592748   cycles - (  0) proc_0: crt_strlen
1342550   cycles - (105) proc_1: SSE 32
1361220   cycles - (104) proc_2: AxStrLenSSE
1319353   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1256349   cycles - (673) proc_8: SSE Intel Silvermont
1586403   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711483   cycles - (  0) proc_0: crt_strlen
927442    cycles - (105) proc_1: SSE 32
962485    cycles - (104) proc_2: AxStrLenSSE
968221    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
895419    cycles - (673) proc_8: SSE Intel Silvermont
1476721   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
659013    cycles - (  0) proc_0: crt_strlen
286781    cycles - (105) proc_1: SSE 32
333387    cycles - (104) proc_2: AxStrLenSSE
339087    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
255315    cycles - (673) proc_8: SSE Intel Silvermont
320034    cycles - (818) proc_9: SSE Intel Atom

result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

I:\nidudStrlen2\strlen2 => timeit
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1596428   cycles - (  0) proc_0: crt_strlen
1353058   cycles - (105) proc_1: SSE 32
1362316   cycles - (104) proc_2: AxStrLenSSE
1351833   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1251872   cycles - (673) proc_8: SSE Intel Silvermont
1600414   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711735   cycles - (  0) proc_0: crt_strlen
926670    cycles - (105) proc_1: SSE 32
952517    cycles - (104) proc_2: AxStrLenSSE
982818    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
906223    cycles - (673) proc_8: SSE Intel Silvermont
1443038   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
681045    cycles - (  0) proc_0: crt_strlen
279759    cycles - (105) proc_1: SSE 32
326521    cycles - (104) proc_2: AxStrLenSSE
339426    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
263106    cycles - (673) proc_8: SSE Intel Silvermont
319237    cycles - (818) proc_9: SSE Intel Atom

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen

Antariy

Quote from: nidud on March 24, 2015, 02:47:14 AM
Quotewhy does SSE 32 run faster with the longer string?

It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.

I think Jochen meant that your code is faster with longer string, so, it's faster with 80 40 bytes string than with 40 bytes string, so it faster with less important string.

The timings of your latest zip (strlen2)

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
----------------------------------------------
-- string length 0..40
1631130   cycles - (  0) proc_0: crt_strlen
1438090   cycles - (105) proc_1: SSE 32
2560055   cycles - (104) proc_2: AxStrLenSSE
2488369   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1332828   cycles - (673) proc_8: SSE Intel Silvermont
1697691   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1756723   cycles - (  0) proc_0: crt_strlen
999911    cycles - (105) proc_1: SSE 32
1511301   cycles - (104) proc_2: AxStrLenSSE
1492232   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
972191    cycles - (673) proc_8: SSE Intel Silvermont
1551460   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
704575    cycles - (  0) proc_0: crt_strlen
305959    cycles - (105) proc_1: SSE 32
416240    cycles - (104) proc_2: AxStrLenSSE
418694    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
280099    cycles - (673) proc_8: SSE Intel Silvermont
348494    cycles - (818) proc_9: SSE Intel Atom

result:
  2585118 cycles - (673) proc_8: SSE Intel Silvermont
  2743960 cycles - (105) proc_1: SSE 32
  3597645 cycles - (818) proc_9: SSE Intel Atom
  4092428 cycles - (  0) proc_0: crt_strlen
  4399295 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  4487596 cycles - (104) proc_2: AxStrLenSSE
hit any key to continue...


No answers to any other posts since I get bored writing the long post posted on the lab - the "Research".
BTW, timings there are required :biggrin:

Antariy

Quote from: Gunther on March 24, 2015, 04:17:22 AM
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen

Quote from: dedndave on March 24, 2015, 06:57:32 AM
result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen

Not bad for simple (and short rrr's version) single-reg XMM code, taking in account the crazy unrollement of Intel's code and higher chunk size of the nidud's code.

Antariy

Well, I think, that if advance the algo posted in the lab in the "Research" topic, make the short strings 64...256 bytes workaround using more simpler code, like nidud's 32 bytes fetching one is, then it will be two times smaller than intels Silvermont and probably faster than it.

BTW, nidud, you may try to use such a code in your inner loop:


@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

cmp eax,edx
jz @mainloop

shl edx,16
add edx,eax
bsf ecx,edx


It avoids unnecessary shifts and combining in the loop.

nidud

#80
deleted

Antariy

Quote from: nidud on March 24, 2015, 08:29:21 AM
Quote from: Antariy on March 24, 2015, 08:07:25 AM
BTW, nidud, you may try to use such a code in your inner loop:

I think it will fail if they are equal but not zero.

Yes... silly idea, bored with coding of prototype for the thread with test on the lab, and typing for two days ::)

nidud

#82
deleted

Antariy

Then try it so (I'm not sure it will be faster, just suggesting to try)

@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

add eax,edx
jz @mainloop
jc @mainloop

pmovmskb eax,xmm0

shl edx,16
add edx,eax
bsf ecx,edx


The changed/added instructions without indent.
Take notice that jz misprediction possibility since the both EAX and EDX are 80000000h is incredible small, so jc instruction will almost always be executed only once - and only after proper exit, so it will not cost much, but the loop itself will decrease number of operations.

Antariy

Quote from: nidud on March 24, 2015, 09:14:22 AM
I tried a few combinations yesterday
I first used this one

lea eax,[eax+20h]
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1

but this seems to faster for some reason

movdqa  xmm1,[eax+30h]
movdqa  xmm0,[eax+20h]
pcmpeqb xmm0,xmm2
pmovmskb edx,xmm0
lea eax,[eax+20h]
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1


Reversed order of the data fetch - the higher offset, then lower offset - I did also used that in the big code on the "Research" thread, too, in the first part which handles 64 bytes misalignment with two 32 bytes part :biggrin:

I think that first piece you showed here worked slower, and with possibility of incorrect work (i.e. it did not worked at all as required, hence the slowdowns probably) - because you compare the same XMM1 reg - but if the first comparsion changed some byte(s) of it to 0xFF, then the second comparsion actually compares not with zeroed reg, but with a mess - so, being changed randomly from time to time the reg slowing the code down and moreover it will produce wrong results with second comparsion if the first comparsion found zero.

rrr314159

There's a problem with this whole exercise: same code runs very different on different processors.

Quote from: nidudbut this seems to faster for some reason

I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)

Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

This reminds me of a Go proverb (an oriental board game, you've probably heard of it). Often a student will go to great lengths to keep a group alive, destroying his entire position in the process. It would have been much better to sacrifice the group, for advantage elsewhere; this way he's guaranteed to lose. Then the teacher says, "If that's what you have to do to stay alive, why don't you just kill yourself?" That's how I feel about 700 lines to save a few nanoseconds looking for the end of a string!

Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.

nidud your test bed is very good, this is the first time I used it much, the timings are very stable and reliable compared to other methods. It's also very convenient (once u get used to it) to just change 3.asm and recompile, without needing to recompile timeit.asm or any other algo's.

Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!

.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

AxStrLenSSE2 proc STDCALL lpszStr:DWORD  ; fast with unaligned strings
; original by Antariy modified by rrr314159
    mov eax,[esp+4]
    mov edx, eax
    and edx,-10h
    pxor xmm0,xmm0
    mov ecx,eax
    and ecx,0fh
    pcmpeqb xmm0,[edx]
    pmovmskb eax,xmm0
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm0,xmm0
    pxor xmm1, xmm1
    pxor xmm2, xmm2
    @@:                                ; naturally aligned to 16
        pcmpeqb xmm0,[edx+10h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat10h

        pcmpeqb xmm1,[edx+20h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat20h
   
        pcmpeqb xmm2,[edx+30h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat30h

        pcmpeqb xmm0,[edx+40h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat40h

        pcmpeqb xmm1,[edx+50h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat50h
   
        pcmpeqb xmm2,[edx+60h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat60h

        pcmpeqb xmm0,[edx+70h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat70h

        pcmpeqb xmm1,[edx+80h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat80h
   
        pcmpeqb xmm2,[edx+90h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat90h

        add edx, 90h
        jmp @B

gotat90h:
    add edx, 10h
gotat80h:
    add edx, 10h
gotat70h:
    add edx, 10h
gotat60h:
    add edx, 10h
gotat50h:
    add edx, 10h
gotat40h:
    add edx, 10h
gotat30h:
    add edx, 10h
gotat20h:
    add edx, 10h
gotat10h:
    add edx, 10h

    bsf eax,eax
    sub edx,[esp+4]
    add eax, edx

    @ret:
    ret 4
AxStrLenSSE2 endp
END


Here are the timings:

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (AVX)
----------------------------------------------
-- string length 0..40
462421    cycles - (  0) proc_0: crt_strlen
239532    cycles - (105) proc_1: SSE 32
232826    cycles - (104) proc_2: AxStrLenSSE
217896    cycles - (229) proc_3: AxStrLenSSE (rrr)
242578    cycles - (673) proc_8: SSE Intel Silvermont
395083    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
446781    cycles - (  0) proc_0: crt_strlen
167097    cycles - (105) proc_1: SSE 32
164663    cycles - (104) proc_2: AxStrLenSSE
147078    cycles - (229) proc_3: AxStrLenSSE (rrr)
151820    cycles - (673) proc_8: SSE Intel Silvermont
282262    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
349538    cycles - (  0) proc_0: crt_strlen
73973     cycles - (105) proc_1: SSE 32
94474     cycles - (104) proc_2: AxStrLenSSE
55565     cycles - (229) proc_3: AxStrLenSSE (rrr)
46600     cycles - (673) proc_8: SSE Intel Silvermont
61976     cycles - (818) proc_9: SSE Intel Atom

result:
   420539 cycles - (229) proc_3: AxStrLenSSE (rrr)
   440998 cycles - (673) proc_8: SSE Intel Silvermont
   480602 cycles - (105) proc_1: SSE 32
   491963 cycles - (104) proc_2: AxStrLenSSE
   739321 cycles - (818) proc_9: SSE Intel Atom
  1258740 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

AMD A6-6310 APU with AMD Radeon R4 Graphics (AVX)
----------------------------------------------
-- string length 0..40
749682    cycles - (  0) proc_0: crt_strlen
478036    cycles - (105) proc_1: SSE 32
537144    cycles - (104) proc_2: AxStrLenSSE
542685    cycles - (229) proc_3: AxStrLenSSE (rrr)
443542    cycles - (673) proc_8: SSE Intel Silvermont
618257    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
876412    cycles - (  0) proc_0: crt_strlen
321733    cycles - (105) proc_1: SSE 32
401381    cycles - (104) proc_2: AxStrLenSSE
361959    cycles - (229) proc_3: AxStrLenSSE (rrr)
310989    cycles - (673) proc_8: SSE Intel Silvermont
448527    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
520175    cycles - (  0) proc_0: crt_strlen
109789    cycles - (105) proc_1: SSE 32
113737    cycles - (104) proc_2: AxStrLenSSE
102691    cycles - (229) proc_3: AxStrLenSSE (rrr)
79685     cycles - (673) proc_8: SSE Intel Silvermont
95800     cycles - (818) proc_9: SSE Intel Atom

result:
   834216 cycles - (673) proc_8: SSE Intel Silvermont
   909558 cycles - (105) proc_1: SSE 32
  1007335 cycles - (229) proc_3: AxStrLenSSE (rrr)
  1052262 cycles - (104) proc_2: AxStrLenSSE
  1162584 cycles - (818) proc_9: SSE Intel Atom
  2146269 cycles - (  0) proc_0: crt_strlen
hit any key to continue...
I am NaN ;)

Antariy

Hi rrr :t

I don't ignore your messages, just was too tired to answer thorougly.
The one note right now - yes, AMDs look very slow with BSF, though that instuction is executed not in the inner loop so it has noticeable influence only with short strings, but you may speed it up on the AMD just using BSF AX,AX instead of EAX,EAX. On Intel that has no much difference (it's a bit slower), but for AMD makes sense. With the one-reg using algo which provides 16 bit mask the usage of 16 bit BSF on AMD is a right choice.

nidud, here is a tweak of your algo which in this form comes close to the algo posted on the "Research" thread on my machine:


strlen proc STDCALL lpszStr:DWORD
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
add edx,ecx
jbe loop_32
shl ecx,16
pmovmskb edx,xmm0
add ecx,edx
done:

bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen endp


Actually I think that if you will use an entry like in mine StrLen (just shift right overhead bits) and this inner loop (that is strange, but the movdqa/pcmpeqb sequence seems faster than just direct pcmpeqb) and the checking with exit from the loop as I suggested, this code will be probably the one of the best solutions for StrLen on the 32 bit machines. The crazy-unrolled Intel's solution has no any gain in real world.
Probably the rrr's unrolling is the maximum which has sense, but it's strange to look on the algos which are 600+ or even 800+ bytes long, just to check the string length on the, after all, real world machine in the real world app.

nidud

#87
deleted

jj2007

i5 test with a 200 MB file - 50 bibles concatenated:
Len(): 55896 us for 1519150 lines, 200850450 bytes
len(): 109353 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51731 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51359 us for 1519150 lines, 200850450 bytes

Len(): 50408 us for 1519150 lines, 200850450 bytes
len(): 108360 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51540 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 52424 us for 1519150 lines, 200850450 bytes

Len(): 52240 us for 1519150 lines, 200850450 bytes
len(): 108979 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 50288 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51747 us for 1519150 lines, 200850450 bytes

Len(): 51231 us for 1519150 lines, 200850450 bytes
len(): 107123 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 49613 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51625 us for 1519150 lines, 200850450 bytes

Len(): 50967 us for 1519150 lines, 200850450 bytes
len(): 107987 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51297 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51591 us for 1519150 lines, 200850450 bytes


AxStrLenSSE is a clear winner.

nidud

#89
deleted