Author Topic: Faster Memcopy ...  (Read 32893 times)

dedndave

  • Member
  • *****
  • Posts: 8823
  • Still using Abacus 2.0
    • DednDave
Re: Faster Memcopy ...
« Reply #15 on: March 05, 2015, 10:47:47 AM »
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
you never seem to get both

in the past, i have done something like this:
Code: [Select]
1) test lower bits of address register for desired alignment
   branch if aligned
2) perform operation on a byte (copy or whatever)
3) test lower bits of address again
   branch if aligned
4) perform operation on a byte
and so on, until alignment is achieved (the lower address bits are 0's)

there's no good way to make that pretty   ;)

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #16 on: March 05, 2015, 02:38:53 PM »
Don't forget u need to check also if you're exceeding the length to copy! What if it takes 15 to align, but they only asked u to copy 3? Anyway, the way I did it, first figured how many bytes needed to get to alignment, then did that many. keepingrealbusy had a totally different way (see his post). In this case not real important how u do it, since once u get there you'll probably copy many thousands of bytes; a few instruction cycles more or less to align really doesn't matter. Interesting to see 3 different ways to do such a seemingly straightforward task; wonder how many more there are!
I am NaN ;)

KeepingRealBusy

  • Member
  • ***
  • Posts: 426
Re: Faster Memcopy ...
« Reply #17 on: March 08, 2015, 06:16:36 AM »
rrr,

The primary concern is where the data is located and where is has to reside, What I usually run into is that I need to read in a file of data. I allocate the buffers with VirtualAlloc and read in the data as unbuffered (for all but the last partial disk sector), and then read in that block as buffered and add the size read to the size of the prior blocks (close the file and open it as buffered and position it the last partial size from the end of the file, then read buffered).

If you need to keep the same alignment with data that is on an unaligned boundary, then use my method. Just allocate a buffer that is the required size rounded up to mod 16, then offset the start of the destination data area to match the unaligned source, unaligned copy the starting 16 bytes and the ending 16 bytes, then copy all of the middle bytes (starting at the first aligned location) using aligned moves.

Dave.

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #18 on: March 08, 2015, 08:20:39 AM »
That approach sounds good IF you have the freedom to allocate your destination buffer, but that's not the point of this exercise. I'm given both buffers by the user and must write where he specifies. Normally dest won't have same alignment as source, and I have to put up with it. If they happen to match, my method works out very similar to yours (BTW note that movdqu is almost as fast as movdqa these days, on aligned data). But normally they won't be aligned so only half the moves can use movdqa; in my case it's the writes; u could do it with the reads instead; seems to make little difference.
I am NaN ;)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6758
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #19 on: March 08, 2015, 09:22:54 AM »
> (BTW note that movdqu is almost as fast as movdqa these days, on aligned data)

This is where the action is with later hardware and while 16 byte aligned is slightly faster on some hardware, its not by much and its a lot of messing around to set it up. You will tend to be restricted more by the memory bottleneck that the instructions that move or copy memory and I mainly see the gain in later XMM as what it can do with 128 bit block of data as there are enough techniques that just grunt memory content around that are fast enough in most contexts. The actual data processing gives the main speed gain, not the memory copy speed.

There are many contexts where I happily use byte level copy as it is unproblematic and often a lot faster than the data processing done with the data.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #20 on: March 08, 2015, 10:13:35 AM »
My conclusion from the last test was three versions:
 - one for machines without SIMD functions
 - one using SIMD functions
 - and one for AVX++ using movsb.

This version uses SIMD functions and handle overlapping data:
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
push edx
mov eax,[esp+16]
mov esi,[esp+20]
mov ecx,[esp+24]
movdqu  xmm1,[esi] ; save aligned and tail bytes
movdqu  xmm2,[esi+ecx-16]
test ecx,-32 ; need 32 bytes
jz copy_0_31
mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
mov edx,esi ; get direction of copy
sub edx,eax
cmp edx,ecx
lea edx,[eax+ecx-16]
jbe overlapped ; move left if not overlapping
add esi,edi
sub ecx,edi
add edi,eax
and ecx,dword ptr -16 ; align EIP and ECX 16
align 16
loop_L:
REPEAT  0
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jz done
ENDM
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jnz loop_L
align 16
done:
movdqu  [eax],xmm1
movdqu  [edx],xmm2
toend:
pop edx
pop edi
pop esi
ret 12
;----------------------------------------------------------------------
; Overlapping buffers
;----------------------------------------------------------------------
align 16
overlapped:
sub ecx,edi
and ecx,dword ptr -16
add edi,ecx
add esi,edi
add edi,eax
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R
jmp done
;----------------------------------------------------------------------
; Copy 0..31 byte
;----------------------------------------------------------------------
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2_3
test ecx,-8
jz copy_4_7
test ecx,-16
jz copy_8_15
lea edx,[eax+ecx-16]
jmp done
align 4
copy_8_15:
movq xmm0,[esi+ecx-8]
movq [eax],xmm1
movq [eax+ecx-8],xmm0
jmp toend
align 4
copy_4_7:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
align 4
copy_2_3:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
align 4
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

However, this was faster on newer machines:
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
mov eax,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
mov edi,eax
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
@@:
lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
memcpy  endp

I added the function copying 128 byte to the test:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
560692    cycles - (  0) proc_0: ??? crt_memcpy *
1040554   cycles - ( 44) proc_1: AVX -   1 byte *
620459    cycles - (242) proc_2: 386 -   4 byte *
381270    cycles - (245) proc_3: SSE -  16 byte *
1822737   cycles - (164) proc_4: SSE - 128 byte
--------- short: 271
506692    cycles - (  0) proc_0: ??? crt_memcpy *
1101216   cycles - ( 44) proc_1: AVX -   1 byte *
455698    cycles - (242) proc_2: 386 -   4 byte *
210824    cycles - (245) proc_3: SSE -  16 byte *
394439    cycles - (164) proc_4: SSE - 128 byte
--------- short: 2014
931445    cycles - (  0) proc_0: ??? crt_memcpy *
3310250   cycles - ( 44) proc_1: AVX -   1 byte *
1262714   cycles - (242) proc_2: 386 -   4 byte *
448062    cycles - (245) proc_3: SSE -  16 byte *
574491    cycles - (164) proc_4: SSE - 128 byte
--------- aligned: 262159
1314077   cycles - (  0) proc_0: ??? crt_memcpy *
3795774   cycles - ( 44) proc_1: AVX -   1 byte *
1363916   cycles - (242) proc_2: 386 -   4 byte *
992821    cycles - (245) proc_3: SSE -  16 byte *
1061304   cycles - (164) proc_4: SSE - 128 byte
--------- unaligned: 262159
1312767   cycles - (  0) proc_0: ??? crt_memcpy *
3794592   cycles - ( 44) proc_1: AVX -   1 byte *
1352670   cycles - (242) proc_2: 386 -   4 byte *
995327    cycles - (245) proc_3: SSE -  16 byte *
1326086   cycles - (164) proc_4: SSE - 128 byte

result: * memcpy = memmove
  3028304 cycles - proc_3: SSE -  16 byte *
  4625673 cycles - proc_0: ??? crt_memcpy *
  5055457 cycles - proc_2: 386 -   4 byte *
  5179057 cycles - proc_4: SSE - 128 byte
 13042386 cycles - proc_1: AVX -   1 byte *

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #21 on: March 08, 2015, 11:40:47 AM »
Hi nidud,

it's too bad you didn't use the version of the 128-byte mover that also works for larger memory blocks (20,000 or so). Originally I just wanted to show the technique for small jobs around 2048, but hutch suggested I try it out at higher numbers, the result was the second version posted in reply #7 (see discussion). So, if I get around to it I'll plug it in your tester and see how it does. My guess is it will come in second, behind proc_3, judging by my previous numbers. Still, I don't know if I can summon the energy - timing algo's is very dull work!

As for the short 31 bytes, simple byte copy is much better for such jobs. In fact pencil and paper would probably be better.

I feel obligated to check it out - but I'd promised myself I wouldn't touch timing again for at least a year ... so if u feel like doing it, go for it! - in Reply #7.
I am NaN ;)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6758
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #22 on: March 08, 2015, 12:23:48 PM »
> but I'd promised myself I wouldn't touch timing again for at least a year

Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #23 on: March 09, 2015, 02:04:50 AM »
here is the test of the new version:  :t
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
520885   cycles - (  0) proc_0: ??? crt_memcpy *
1043349   cycles - ( 44) proc_1: AVX - 1 byte *
620460   cycles - (242) proc_2: 386 - 4 byte *
380509   cycles - (245) proc_3: SSE -  16 byte *
1902620   cycles - (164) proc_4: SSE - 128 byte
1540727   cycles - (210) proc_5: SSE - 128 byte new
--------- short: 271
506894   cycles - (  0) proc_0: ??? crt_memcpy *
1101241   cycles - ( 44) proc_1: AVX - 1 byte *
455809   cycles - (242) proc_2: 386 - 4 byte *
214328   cycles - (245) proc_3: SSE -  16 byte *
401273   cycles - (164) proc_4: SSE - 128 byte
312955   cycles - (210) proc_5: SSE - 128 byte new
--------- short: 2014
942392   cycles - (  0) proc_0: ??? crt_memcpy *
3310465   cycles - ( 44) proc_1: AVX - 1 byte *
1261069   cycles - (242) proc_2: 386 - 4 byte *
450311   cycles - (245) proc_3: SSE -  16 byte *
575949   cycles - (164) proc_4: SSE - 128 byte
483387   cycles - (210) proc_5: SSE - 128 byte new
--------- aligned: 262159
1311687   cycles - (  0) proc_0: ??? crt_memcpy *
3794303   cycles - ( 44) proc_1: AVX - 1 byte *
1369066   cycles - (242) proc_2: 386 - 4 byte *
992980   cycles - (245) proc_3: SSE -  16 byte *
1020577   cycles - (164) proc_4: SSE - 128 byte
1010543   cycles - (210) proc_5: SSE - 128 byte new
--------- unaligned: 262159
1312919   cycles - (  0) proc_0: ??? crt_memcpy *
3797085   cycles - ( 44) proc_1: AVX - 1 byte *
1355184   cycles - (242) proc_2: 386 - 4 byte *
993036   cycles - (245) proc_3: SSE -  16 byte *
1427699   cycles - (164) proc_4: SSE - 128 byte
1029638   cycles - (210) proc_5: SSE - 128 byte new

result: * memcpy = memmove
  3031164 cycles - proc_3: SSE -  16 byte *
  4377250 cycles - proc_5: SSE - 128 byte new
  4594777 cycles - proc_0: ??? crt_memcpy *
  5061588 cycles - proc_2: 386 -   4 byte *
  5328118 cycles - proc_4: SSE - 128 byte
 13046443 cycles - proc_1: AVX -   1 byte *

However, with some modifications the algo may even be faster  :biggrin:
Code: [Select]
memcpy  proc dst, src, count

push esi
push edi
mov eax,[esp+12] ; keep return value in eax
mov esi,[esp+16]
mov ecx,[esp+20]
;
; skip byte copy of aligned and tail bytes
;
movdqu  xmm0,[esi] ; save aligned bytes
cmp ecx,32 ; need 32 bytes
jb copy_0_31

movdqu  xmm1,[esi+ecx-16] ; copy aligned and tail bytes
movdqu  [eax],xmm0 ;
movdqu  [eax+ecx-16],xmm1
;
; now only chuncs of 16 left to copy
;
mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
add esi,edi
sub ecx,edi
add edi,eax

mov edx,ecx
shr ecx,7 ; divide by 128 (chunk size)
jz endloop128
;
; make shore loop-label is align 16
;
align 16

loop128: ; do chunks of 128
;
; this generate two extra jumps in the loop
;
; jle endloop128
; dec ecx ; dec ecx long b4 using it
sub ecx,dword ptr 1
movdqu  xmm1,[esi] ; read all 8 xmm's
movdqu  xmm2,[esi+10h]
movdqu  xmm3,[esi+20h]
movdqu  xmm4,[esi+30h]
lea esi, [esi+40h]  ; seems to work best adding here
movdqu  xmm5,[esi]
movdqu  xmm6,[esi+10h]
movdqu  xmm7,[esi+20h]
movdqu  xmm0,[esi+30h]
lea esi, [esi+40h]  ; add to esi long b4 use
;------------------------------------
movdqa  [edi],xmm1 ; write all 8 xmm's
movdqa  [edi+10h],xmm2
movdqa  [edi+20h],xmm3
movdqa  [edi+30h],xmm4
lea edi,[edi+40h]
movdqa  [edi],xmm5
movdqa  [edi+10h],xmm6
movdqa  [edi+20h],xmm7
movdqa  [edi+30h],xmm0
lea edi,[edi+40h]
jg loop128
; jmp loop128

align 16
endloop128:

and edx,127 ; tail bytes in ebx
shr edx,4 ; ebx has chunks of 16
jz toend

align 8
loop16:
dec edx
movdqu  xmm1,[esi] ; just one xmm could use bigger
lea esi,[esi+10h] ; chunks (32, 48 etc) but why bother
movdqa  [edi],xmm1
lea edi,[edi+10h]
jg loop16
align 4
toend:
pop edi
pop esi
ret 12
;
; Copy 0..31 byte
;
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2_3
test ecx,-8
jz copy_4_7
test ecx,-16
jz copy_8_15
movdqu  xmm1,[esi+ecx-16]
movdqu  [eax],xmm0
movdqu  [eax+ecx-16],xmm1
jmp toend
align 4
copy_8_15:
movq xmm1,[esi+ecx-8]
movq [eax],xmm0
movq [eax+ecx-8],xmm1
jmp toend
align 4
copy_4_7:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
align 4
copy_2_3:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
align 4
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
520271    cycles - (  0) proc_0: ??? crt_memcpy *
1040219   cycles - ( 44) proc_1: AVX -   1 byte *
622641    cycles - (242) proc_2: 386 -   4 byte *
380032    cycles - (245) proc_3: SSE -  16 byte *
1821193   cycles - (164) proc_4: SSE - 128 byte
1571898   cycles - (209) proc_5: SSE - 128 byte new
340027    cycles - (314) proc_6: SSE - 128 byte modified
--------- short: 271
506636    cycles - (  0) proc_0: ??? crt_memcpy *
1101511   cycles - ( 44) proc_1: AVX -   1 byte *
455635    cycles - (242) proc_2: 386 -   4 byte *
212339    cycles - (245) proc_3: SSE -  16 byte *
394498    cycles - (164) proc_4: SSE - 128 byte
312829    cycles - (209) proc_5: SSE - 128 byte new
183604    cycles - (314) proc_6: SSE - 128 byte modified
--------- short: 2014
931433    cycles - (  0) proc_0: ??? crt_memcpy *
3312946   cycles - ( 44) proc_1: AVX -   1 byte *
1264095   cycles - (242) proc_2: 386 -   4 byte *
446437    cycles - (245) proc_3: SSE -  16 byte *
574598    cycles - (164) proc_4: SSE - 128 byte
485746    cycles - (209) proc_5: SSE - 128 byte new
419227    cycles - (314) proc_6: SSE - 128 byte modified
--------- aligned: 262159
1308439   cycles - (  0) proc_0: ??? crt_memcpy *
3793440   cycles - ( 44) proc_1: AVX -   1 byte *
1364127   cycles - (242) proc_2: 386 -   4 byte *
991860    cycles - (245) proc_3: SSE -  16 byte *
1044200   cycles - (164) proc_4: SSE - 128 byte
1043795   cycles - (209) proc_5: SSE - 128 byte new
1044381   cycles - (314) proc_6: SSE - 128 byte modified
--------- unaligned: 262159
1311390   cycles - (  0) proc_0: ??? crt_memcpy *
3797150   cycles - ( 44) proc_1: AVX -   1 byte *
1355516   cycles - (242) proc_2: 386 -   4 byte *
993870    cycles - (245) proc_3: SSE -  16 byte *
1389335   cycles - (164) proc_4: SSE - 128 byte
1061247   cycles - (209) proc_5: SSE - 128 byte new
1058086   cycles - (314) proc_6: SSE - 128 byte modified

result: * memcpy = memmove
  3024538 cycles - proc_3: SSE -  16 byte *
  3045325 cycles - proc_6: SSE - 128 byte modified
  4475515 cycles - proc_5: SSE - 128 byte new
  4578169 cycles - proc_0: ??? crt_memcpy *
  5062014 cycles - proc_2: 386 -   4 byte *
  5223824 cycles - proc_4: SSE - 128 byte
 13045266 cycles - proc_1: AVX -   1 byte *

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #24 on: March 09, 2015, 06:18:24 AM »
You understand, I was merely demonstrating the efficacy of using 8 xmm reg's and paid no attention at all to other details ... simply slapped it together. However, thanks for the mods. I'll look at it (now you've got me interested) and see if any further improvements are available.

[edit] also thanks for going ahead and testing the "original new" algo; I know what a PITA all this can be ;)
I am NaN ;)

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #25 on: March 09, 2015, 01:24:59 PM »
@nidud,
I finally managed to get my memory copy algo faster than yours, but it's a close shave. I used some of your suggestions. Doing the tail that way hadn't occurred to me until keepingrealbusy suggested it. Also an adaptation of your short-copy handling (similar to Agner Fog's). Wouldn't have bothered without your example, which proved these ideas better, so thanks. Code is below.

I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment, so I put together this extremely standard masm32 test bed with MichaelW's algo's. Hope it will suit. Also included shorter test and, more important, made runs aligned and also misaligned.

Here's a run where my algo is best on every number, but that's unusual, had to make 6 runs to get it :biggrin: Generally, tho, it seems overall best. Shorter too.

mem2 = nidud's improved version of rrr's first algo (was called mem4)
mem3 = nidud's algo (equals mem3 in his test prog)
mem4 = rrr latest (hopefully last) attempt

Code: [Select]
test.asm: 506 lines, 3 passes, 3688 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(300)

     8        2387            2359            2337

    31        2396            2291            2193

   271        7372            9300            6269

  2014       20854           25887           13538

262159     5229387         5238006         5165092

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(300)

     8        1198            1190            1180

    31        1248            1198            1147

   271        3952            5367            3714

  2014       14134           25988           14074

262159     5266919         5485524         5263508

Code: [Select]
; «««««««««««««««««««««««««««««««««««««««««««««««««««
mem4 proc Dest:PTR BYTE, Source:PTR BYTE, ln:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; written by rrr314159 2015 03/08

push esi
push edi
mov edi, [Dest]
mov esi, [Source]
mov ecx, [ln]

; if very small get it over with quick
cmp ecx, 16
jge bigger_than_15
test ecx, 8
        jz @f
        movq xmm1, [esi]
        movq xmm0, [esi+ecx-8]
        movq [edi], xmm1
        movq [edi+ecx-8], xmm0
        jmp end_memcopy
@@:
test ecx, 4
        jz @f
        mov eax, [esi]
        mov ebx, [esi+ecx-4]
        mov [edi], eax
        mov [edi+ecx-4], ebx
        jmp end_memcopy
@@:
test ecx, 2
        jz @f
        mov ax, [esi]
        mov [edi], ax
@@:
test ecx, 1
        jz end_memcopy
        mov al, [esi+ecx-1]
        mov [edi+ecx-1], al
        jmp end_memcopy

bigger_than_15:
    movdqu xmm1, [esi]              ; 16 or greater it's safe
    movdqu xmm0, [esi+ecx-16]
    movdqu [edi], xmm1              ; to do head
    movdqu [edi+ecx-16], xmm0       ; and tail here

cmp ecx, 32
jg @f

end_memcopy:                        ; put end close to short copies, quicker
    pop edi
    pop esi
    ret 12
@@:

; continue for more than 32 bytes - check dest alignment

test edi, 15
jz destisaligned
    mov ebx, edi
    and ebx, -16
    add ebx, 16                     ; ebx gets bytes to align edi (1<= ebx <=15)
    sub ebx, edi                    ; don't need to check < ecx
    sub ecx, ebx                    ; how many bytes will be left to do in ecx
    lea edi, [edi+ebx]              ; align dest
    lea esi, [esi+ebx]              ; make source match
destisaligned:                      ; esi, edi and ecx shld all be right

mov ebx, ecx
and ecx, 127                        ; tail bytes (up to 127) in ecx
shr ebx, 7                          ; divide by 128 (chunk size)
jle endloop128

align 4
loop128:                            ; do chunks of 128
    dec ebx                         ; dec ecx long b4 using it
    movdqu xmm1, [esi]              ; read all 8 xmm's
    movdqu xmm2, [esi+10h]
    movdqu xmm3, [esi+20h]
    movdqu xmm4, [esi+30h]
    lea esi, [esi+40h]              ; seems to work best adding here
   
    movdqu xmm5, [esi]
    movdqu xmm6, [esi+10h]
    movdqu xmm7, [esi+20h]
    movdqu xmm0, [esi+30h]
    lea esi, [esi+40h]              ; add to esi long b4 use
   
    movdqa [edi], xmm1              ; write all 8 xmm's
    movdqa [edi+10h], xmm2
    movdqa [edi+20h], xmm3
    movdqa [edi+30h], xmm4
    lea edi, [edi+40h]
   
    movdqa [edi], xmm5
    movdqa [edi+10h], xmm6
    movdqa [edi+20h], xmm7
    movdqa [edi+30h], xmm0
    lea edi, [edi+40h]
   
    jg loop128
endloop128:

shr ecx, 4                          ; ebx has chunks of 16   
jle end_memcopy

loop16:
    dec ecx
    movdqu xmm1, [esi]              ; just one xmm could use bigger
    lea esi, [esi+10h]              ; chunks (32, 48 etc) but why bother
    movdqa [edi], xmm1
    lea edi, [edi+10h]
    jg loop16

jmp end_memcopy

mem4 endp

The zip contains test.asm with all 3 algos, and the exe
« Last Edit: March 09, 2015, 02:38:57 PM by rrr314159 »
I am NaN ;)

KeepingRealBusy

  • Member
  • ***
  • Posts: 426
Re: Faster Memcopy ...
« Reply #26 on: March 10, 2015, 04:37:53 AM »
rrr,

Thanks for giving me the tail end credit. Most would grab the code and run.

Dave

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6758
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #27 on: March 10, 2015, 05:43:08 AM »
This was basically the last time I played with memory copy.

Code: [Select]
    mov esi, pad1
    mov edi, pad2

  ; --------------------------------------
  ; if blen < 128 jump to tail end trimmer
  ; --------------------------------------
    mov eax, blen
    cmp eax, 16
    jae nxt
    mov scnt, eax
    xor edx, edx
    jmp bypass

  nxt:
    shr eax, 4                ; int div by 128
    mov lcnt, eax
    shl eax, 4                ; int mul by 128
    mov ecx, blen
    sub ecx, eax
    mov scnt, ecx

    mov ecx, lcnt
    mov edx, -16

  ; -------------------------------------------------
  ; perform the bulk of the process in 16 byte blocks
  ; -------------------------------------------------
  align 16
  lbl1:
    add edx, 16
    prefetcht1 BYTE PTR [esi+edx]
    movdqa xmm0, DWORD PTR [esi+edx]
    movntdq [edi+edx], xmm0

    sub ecx, 1
    jnz lbl1

  ; -------------------------------
  ; if no uneven tail, jump to exit
  ; -------------------------------
    cmp scnt, 0
    je bye

  ; -------------------------------------------
  ; process any eneven tail as a BYTE operation
  ; -------------------------------------------
    add edx, 16

  bypass:
    mov ecx, scnt
    add ecx, 1
  lbl2:
    movzx eax, BYTE PTR [edi+edx]
    mov [esi+edx], al
    add edx, 1
    sub ecx, 1
    jnz lbl2

  bye:

It was slower than the REP MOVSD code.

Code: [Select]
    cld
    mov esi, src
    mov edi, dst
    mov ecx, ln

    shr ecx, 2
    rep movsd

    mov ecx, ln
    and ecx, 3
    rep movsb
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #28 on: March 10, 2015, 06:03:04 AM »
Quote
I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment,

One of the problems with these tests is that code location is sensitive to timings. When duplicating the same code you may get this result depending on the offset of the proc:
Code: [Select]
1232712 cycles - 2048..4096  (164) A memcpy SSE 16
877928  cycles - 2048..4096  (164) A memcpy SSE 16
1232112 cycles - 2048..4096  (164) A memcpy SSE 16
878896  cycles - 2048..4096  (164) A memcpy SSE 16
1233148 cycles - 2048..4096  (164) A memcpy SSE 16
879296  cycles - 2048..4096  (164) A memcpy SSE 16

To compensate for this, the algo to be tested is copied to a buffer and executed there:
Code: [Select]
.code

align 16
proc_x  db size_p dup(0)

The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Code: [Select]
.386
.model flat
.code
push esi
push edi
mov eax,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
mov edi,eax
rep movsb
pop edi
pop esi
ret 12
END

In testit.asm you need to add some text and a number:
Code: [Select]
info_7  db "SSE - 128 byte mem4(rrr)",0
info_8  db "x",0
info_9  db "x",0
info_p  dd info_0,info_1,info_2,info_3,info_4
dd info_5,info_6,info_7,info_8,info_9

procs equ <for x,<7,7,7,7,7,7>> ; add procs to test here..

the result should (in theory) be more equal then:
Code: [Select]
1395909   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1392582   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1396457   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1394577   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1396294   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1392589   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)

on copy short values I get this:
Code: [Select]
  1864337 cycles - proc_6: SSE - 128 byte modified
  1940998 cycles - proc_7: SSE - 128 byte mem4(rrr)
  1956522 cycles - proc_3: SSE -  16 byte *
  5412838 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)

on copy long value I get this:
Code: [Select]
--------- unaligned: 262159
4217994   cycles - (269) proc_3: SSE -  16 byte *
4295528   cycles - (209) proc_5: SSE - 128 byte xmemcpy(rrr)
4300409   cycles - (314) proc_6: SSE - 128 byte modified
4302474   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
--------- aligned: 262159
4220315   cycles - (269) proc_3: SSE -  16 byte *
4295777   cycles - (209) proc_5: SSE - 128 byte xmemcpy(rrr)
4299777   cycles - (314) proc_6: SSE - 128 byte modified
4302524   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)

result: * memcpy = memmove
  8438309 cycles - proc_3: SSE -  16 byte *
  8591305 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  8600186 cycles - proc_6: SSE - 128 byte modified
  8604998 cycles - proc_7: SSE - 128 byte mem4(rrr)

Copying larger chunks don't seem to impress this CPU but this may not be the same for others. I tried to insert a 64-byte block-copy into the function:

Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

memcpy  proc dst, src, count

mov eax,esp ; save one byte
push esi
push edi
push edx
mov ecx,[eax+12]
mov esi,[eax+8]
mov eax,[eax+4] ; to land at loop_64 aligned..

movdqu  xmm6,[esi] ; save aligned bytes
cmp ecx,32 ; need 32 bytes
jb copy_0_31

mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
movdqu  xmm7,[esi+ecx-16] ; save tail bytes

mov edx,esi ; get direction of copy
sub edx,eax
cmp edx,ecx
mov edx,ecx
jbe overlapped ; move left if not overlapping

add esi,edi
sub ecx,edi
add edi,eax
and ecx,-16
cmp ecx,64
jb loop_16

align 16
loop_64:
movdqu  xmm0,[esi+ecx-10h] ; copy 64 byte
movdqu  xmm1,[esi+ecx-20h]
movdqu  xmm2,[esi+ecx-30h]
movdqu  xmm3,[esi+ecx-40h]
movdqa  [edi+ecx-10h],xmm0
movdqa  [edi+ecx-20h],xmm1
movdqa  [edi+ecx-30h],xmm2
movdqa  [edi+ecx-40h],xmm3
sub ecx,dword ptr 64
je done
test ecx,dword ptr -64
jnz loop_64
align 16
loop_16:
sub ecx,16
movdqu  xmm0,[esi+ecx] ; copy 16 byte
movdqa  [edi+ecx],xmm0
jnz loop_16
align 16
done:
movdqu  [eax],xmm6
movdqu  [eax+edx-16],xmm7
pop edx
pop edi
pop esi
ret 12
;----------------------------------------------------------------------
; Overlapping buffers
;----------------------------------------------------------------------
align 16
overlapped:
sub ecx,edi
and ecx,dword ptr -16
add edi,ecx
add esi,edi
add edi,eax
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R
jmp done
;----------------------------------------------------------------------
; Copy 0..31 byte
;----------------------------------------------------------------------
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2
test ecx,-8
jz copy_4
test ecx,-16
jz copy_8
movdqu  xmm1,[esi+ecx-16]
movdqu  [eax],xmm6
movdqu  [eax+ecx-16],xmm1
toend:
pop edx
pop edi
pop esi
ret 12
copy_8:
movq xmm1,[esi+ecx-8]
movq [eax],xmm6
movq [eax+ecx-8],xmm1
jmp toend
copy_4:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
copy_2:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

end

  8362882 cycles - proc_4: SSE -  64 byte *
  8441756 cycles - proc_3: SSE -  16 byte *
  8592957 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  8601138 cycles - proc_6: SSE - 128 byte modified
  8601695 cycles - proc_7: SSE - 128 byte mem4(rrr)

testing a combination of all lengths gives this result:
Code: [Select]
11418933 cycles - proc_4: SSE -  64 byte *
 11641683 cycles - proc_3: SSE -  16 byte *
 11962870 cycles - proc_6: SSE - 128 byte modified
 12057471 cycles - proc_7: SSE - 128 byte mem4(rrr)
 15605422 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)

These results depend on CPU and OS used. On newer hardware, as demonstrated here, a simple MOVSB-copy will be the fastest.

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #29 on: March 10, 2015, 05:03:59 PM »
Quote from: dedndave
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol

- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders :biggrin: But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either

Quote from: keepingrealbusy
Thanks for giving me the tail end credit. Most would grab the code and run.
I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything

Quote from: hutch
It was slower than the REP MOVSD code.
I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"

Quote from: nidud
One of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Sure, that seems like an excellent approach.

I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:

Code: [Select]
  mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2969            2759            2566

    31        2706            2651            2559

   271        6891            8023            6884

  2014       15540           39733           15315

262159     5258012         5218352         5255524

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1441            1350            1386

    31        1402            1346            1499

   271        3655            5698            3788

  2014       13863           26579           14449

262159     5318643         5385829         5317655
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2688            2730            2538

    31        1452            1377            1279

   271        3468            4053            3338

  2014       13658           25588           13532

262159     5259190         5255524         5259200

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1465            1510            1352

    31        1388            1327            1286

   271        3577            5337            3571

  2014       13922           25698           14171

262159     5370304         5417275         5319313
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2689            2730            2539

    31        1428            1437            1429

   271        3441            4143            3521

  2014       13636           25594           13482

262159     5211287         5219878         5276594

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1409            1378            1335

    31        1449            1387            1281

   271        3603            5482            3889

  2014       13870           25605           14170

262159     5372493         5509407         5361170
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2709            2754            2550

    31        1439            1407            1353

   271        3529            4226            3449

  2014       14126           25590           13531

262159     5209791         5210645         5211581

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1416            1385            1358

    31        1397            1338            1291

   271        3483            5454            3570

  2014       13910           25713           14117

262159     5315510         5381937         5314749
Press any key to continue ...

My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.

Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).

... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.

Well I reckon we've beaten memory-moving to death! I'm working on some much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!
I am NaN ;)