Faster Memcopy ...

dedndave · March 05, 2015, 10:47:47 AM

from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
you never seem to get both

in the past, i have done something like this:

Code Select

1) test lower bits of address register for desired alignment
   branch if aligned
2) perform operation on a byte (copy or whatever)
3) test lower bits of address again
   branch if aligned
4) perform operation on a byte

and so on, until alignment is achieved (the lower address bits are 0's)

there's no good way to make that pretty ;)

rrr314159 · March 05, 2015, 02:38:53 PM

Don't forget u need to check also if you're exceeding the length to copy! What if it takes 15 to align, but they only asked u to copy 3? Anyway, the way I did it, first figured how many bytes needed to get to alignment, then did that many. keepingrealbusy had a totally different way (see his post). In this case not real important how u do it, since once u get there you'll probably copy many thousands of bytes; a few instruction cycles more or less to align really doesn't matter. Interesting to see 3 different ways to do such a seemingly straightforward task; wonder how many more there are!

KeepingRealBusy · March 08, 2015, 06:16:36 AM

rrr,

The primary concern is where the data is located and where is has to reside, What I usually run into is that I need to read in a file of data. I allocate the buffers with VirtualAlloc and read in the data as unbuffered (for all but the last partial disk sector), and then read in that block as buffered and add the size read to the size of the prior blocks (close the file and open it as buffered and position it the last partial size from the end of the file, then read buffered).

If you need to keep the same alignment with data that is on an unaligned boundary, then use my method. Just allocate a buffer that is the required size rounded up to mod 16, then offset the start of the destination data area to match the unaligned source, unaligned copy the starting 16 bytes and the ending 16 bytes, then copy all of the middle bytes (starting at the first aligned location) using aligned moves.

Dave.

rrr314159 · March 08, 2015, 08:20:39 AM

That approach sounds good IF you have the freedom to allocate your destination buffer, but that's not the point of this exercise. I'm given both buffers by the user and must write where he specifies. Normally dest won't have same alignment as source, and I have to put up with it. If they happen to match, my method works out very similar to yours (BTW note that movdqu is almost as fast as movdqa these days, on aligned data). But normally they won't be aligned so only half the moves can use movdqa; in my case it's the writes; u could do it with the reads instead; seems to make little difference.

hutch-- · March 08, 2015, 09:22:54 AM

> (BTW note that movdqu is almost as fast as movdqa these days, on aligned data)

This is where the action is with later hardware and while 16 byte aligned is slightly faster on some hardware, its not by much and its a lot of messing around to set it up. You will tend to be restricted more by the memory bottleneck that the instructions that move or copy memory and I mainly see the gain in later XMM as what it can do with 128 bit block of data as there are enough techniques that just grunt memory content around that are fast enough in most contexts. The actual data processing gives the main speed gain, not the memory copy speed.

There are many contexts where I happily use byte level copy as it is unproblematic and often a lot faster than the data processing done with the data.

nidud · March 08, 2015, 10:13:35 AM

deleted

rrr314159 · March 08, 2015, 11:40:47 AM

Hi nidud,

it's too bad you didn't use the version of the 128-byte mover that also works for larger memory blocks (20,000 or so). Originally I just wanted to show the technique for small jobs around 2048, but hutch suggested I try it out at higher numbers, the result was the second version posted in reply #7 (see discussion). So, if I get around to it I'll plug it in your tester and see how it does. My guess is it will come in second, behind proc_3, judging by my previous numbers. Still, I don't know if I can summon the energy - timing algo's is very dull work!

As for the short 31 bytes, simple byte copy is much better for such jobs. In fact pencil and paper would probably be better.

I feel obligated to check it out - but I'd promised myself I wouldn't touch timing again for at least a year ... so if u feel like doing it, go for it! - in Reply #7.

hutch-- · March 08, 2015, 12:23:48 PM

> but I'd promised myself I wouldn't touch timing again for at least a year

Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.

nidud · March 09, 2015, 02:04:50 AM

deleted

rrr314159 · March 09, 2015, 06:18:24 AM

You understand, I was merely demonstrating the efficacy of using 8 xmm reg's and paid no attention at all to other details ... simply slapped it together. However, thanks for the mods. I'll look at it (now you've got me interested) and see if any further improvements are available.

[edit] also thanks for going ahead and testing the "original new" algo; I know what a PITA all this can be ;)

rrr314159 · March 09, 2015, 01:24:59 PM

@nidud,
I finally managed to get my memory copy algo faster than yours, but it's a close shave. I used some of your suggestions. Doing the tail that way hadn't occurred to me until keepingrealbusy suggested it. Also an adaptation of your short-copy handling (similar to Agner Fog's). Wouldn't have bothered without your example, which proved these ideas better, so thanks. Code is below.

I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment, so I put together this extremely standard masm32 test bed with MichaelW's algo's. Hope it will suit. Also included shorter test and, more important, made runs aligned and also misaligned.

Here's a run where my algo is best on every number, but that's unusual, had to make 6 runs to get it

Generally, tho, it seems overall best. Shorter too.

mem2 = nidud's improved version of rrr's first algo (was called mem4)
mem3 = nidud's algo (equals mem3 in his test prog)
mem4 = rrr latest (hopefully last) attempt

Code Select

test.asm: 506 lines, 3 passes, 3688 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(300)

     8        2387            2359            2337

    31        2396            2291            2193

   271        7372            9300            6269

  2014       20854           25887           13538

262159     5229387         5238006         5165092

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(300)

     8        1198            1190            1180

    31        1248            1198            1147

   271        3952            5367            3714

  2014       14134           25988           14074

262159     5266919         5485524         5263508

Code Select

; «««««««««««««««««««««««««««««««««««««««««««««««««««
mem4 proc Dest:PTR BYTE, Source:PTR BYTE, ln:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; written by rrr314159 2015 03/08

push esi
push edi
mov edi, [Dest]
mov esi, [Source]
mov ecx, [ln]

; if very small get it over with quick
cmp ecx, 16
jge bigger_than_15
	test ecx, 8
        jz @f
        movq xmm1, [esi]
        movq xmm0, [esi+ecx-8]
        movq [edi], xmm1
        movq [edi+ecx-8], xmm0
        jmp end_memcopy
@@:
	test ecx, 4
        jz @f
        mov eax, [esi]
        mov ebx, [esi+ecx-4]
        mov [edi], eax
        mov [edi+ecx-4], ebx
        jmp end_memcopy
@@:
	test ecx, 2
        jz @f
        mov ax, [esi]
        mov [edi], ax
@@:
	test ecx, 1
        jz end_memcopy
        mov al, [esi+ecx-1]
        mov [edi+ecx-1], al
        jmp end_memcopy

bigger_than_15:
    movdqu xmm1, [esi]              ; 16 or greater it's safe
    movdqu xmm0, [esi+ecx-16]
    movdqu [edi], xmm1              ; to do head
    movdqu [edi+ecx-16], xmm0       ; and tail here

cmp ecx, 32
jg @f

end_memcopy:                        ; put end close to short copies, quicker
    pop edi
    pop esi
    ret 12
@@:

; continue for more than 32 bytes - check dest alignment

test edi, 15
jz destisaligned
    mov ebx, edi
    and ebx, -16
    add ebx, 16                     ; ebx gets bytes to align edi (1<= ebx <=15)
    sub ebx, edi                    ; don't need to check < ecx
    sub ecx, ebx                    ; how many bytes will be left to do in ecx
    lea edi, [edi+ebx]              ; align dest
    lea esi, [esi+ebx]              ; make source match
destisaligned:                      ; esi, edi and ecx shld all be right

mov ebx, ecx
and ecx, 127                        ; tail bytes (up to 127) in ecx
shr ebx, 7                          ; divide by 128 (chunk size)
jle endloop128

align 4
loop128:                            ; do chunks of 128
    dec ebx                         ; dec ecx long b4 using it
    movdqu xmm1, [esi]              ; read all 8 xmm's
    movdqu xmm2, [esi+10h]
    movdqu xmm3, [esi+20h]
    movdqu xmm4, [esi+30h]
    lea esi, [esi+40h]              ; seems to work best adding here
    
    movdqu xmm5, [esi]
    movdqu xmm6, [esi+10h]
    movdqu xmm7, [esi+20h]
    movdqu xmm0, [esi+30h]
    lea esi, [esi+40h]              ; add to esi long b4 use
    
    movdqa [edi], xmm1              ; write all 8 xmm's
    movdqa [edi+10h], xmm2
    movdqa [edi+20h], xmm3
    movdqa [edi+30h], xmm4
    lea edi, [edi+40h]
    
    movdqa [edi], xmm5
    movdqa [edi+10h], xmm6
    movdqa [edi+20h], xmm7
    movdqa [edi+30h], xmm0
    lea edi, [edi+40h]
    
    jg loop128
endloop128:

shr ecx, 4                          ; ebx has chunks of 16    
jle end_memcopy

loop16:
    dec ecx
    movdqu xmm1, [esi]              ; just one xmm could use bigger
    lea esi, [esi+10h]              ; chunks (32, 48 etc) but why bother
    movdqa [edi], xmm1
    lea edi, [edi+10h]
    jg loop16

jmp end_memcopy

mem4 endp

The zip contains test.asm with all 3 algos, and the exe

KeepingRealBusy · March 10, 2015, 04:37:53 AM

rrr,

Thanks for giving me the tail end credit. Most would grab the code and run.

Dave

hutch-- · March 10, 2015, 05:43:08 AM

This was basically the last time I played with memory copy.

Code Select


    mov esi, pad1
    mov edi, pad2

  ; --------------------------------------
  ; if blen < 128 jump to tail end trimmer
  ; --------------------------------------
    mov eax, blen
    cmp eax, 16
    jae nxt
    mov scnt, eax
    xor edx, edx
    jmp bypass

  nxt:
    shr eax, 4                ; int div by 128
    mov lcnt, eax
    shl eax, 4                ; int mul by 128
    mov ecx, blen
    sub ecx, eax
    mov scnt, ecx

    mov ecx, lcnt
    mov edx, -16

  ; -------------------------------------------------
  ; perform the bulk of the process in 16 byte blocks
  ; -------------------------------------------------
  align 16
  lbl1:
    add edx, 16
    prefetcht1 BYTE PTR [esi+edx]
    movdqa xmm0, DWORD PTR [esi+edx]
    movntdq [edi+edx], xmm0

    sub ecx, 1
    jnz lbl1

  ; -------------------------------
  ; if no uneven tail, jump to exit
  ; -------------------------------
    cmp scnt, 0
    je bye

  ; -------------------------------------------
  ; process any eneven tail as a BYTE operation
  ; -------------------------------------------
    add edx, 16

  bypass:
    mov ecx, scnt
    add ecx, 1
  lbl2:
    movzx eax, BYTE PTR [edi+edx]
    mov [esi+edx], al
    add edx, 1
    sub ecx, 1
    jnz lbl2

  bye:

It was slower than the REP MOVSD code.

Code Select


    cld
    mov esi, src
    mov edi, dst
    mov ecx, ln

    shr ecx, 2
    rep movsd

    mov ecx, ln
    and ecx, 3
    rep movsb

nidud · March 10, 2015, 06:03:04 AM

deleted

rrr314159 · March 10, 2015, 05:03:59 PM

Quote from: dedndavefrom my experience, code that aligns for a loop can be either "pretty" or "fast" - lol

- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders

But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either

Quote from: keepingrealbusyThanks for giving me the tail end credit. Most would grab the code and run.

I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything

Quote from: hutchIt was slower than the REP MOVSD code.

I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"

Quote from: nidudOne of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.

Sure, that seems like an excellent approach.

I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:

Code Select

  mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2969            2759            2566

    31        2706            2651            2559

   271        6891            8023            6884

  2014       15540           39733           15315

262159     5258012         5218352         5255524

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1441            1350            1386

    31        1402            1346            1499

   271        3655            5698            3788

  2014       13863           26579           14449

262159     5318643         5385829         5317655
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2688            2730            2538

    31        1452            1377            1279

   271        3468            4053            3338

  2014       13658           25588           13532

262159     5259190         5255524         5259200

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1465            1510            1352

    31        1388            1327            1286

   271        3577            5337            3571

  2014       13922           25698           14171

262159     5370304         5417275         5319313
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2689            2730            2539

    31        1428            1437            1429

   271        3441            4143            3521

  2014       13636           25594           13482

262159     5211287         5219878         5276594

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1409            1378            1335

    31        1449            1387            1281

   271        3603            5482            3889

  2014       13870           25605           14170

262159     5372493         5509407         5361170
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2709            2754            2550

    31        1439            1407            1353

   271        3529            4226            3449

  2014       14126           25590           13531

262159     5209791         5210645         5211581

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1416            1385            1358

    31        1397            1338            1291

   271        3483            5454            3570

  2014       13910           25713           14117

262159     5315510         5381937         5314749
Press any key to continue ...

My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.

Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).

... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.

Well I reckon we've beaten memory-moving to death! I'm working on some much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!

The MASM Forum

News:

Faster Memcopy ...

dedndave

rrr314159

KeepingRealBusy

rrr314159

hutch--

nidud

rrr314159

hutch--

nidud

rrr314159

rrr314159

KeepingRealBusy

hutch--

nidud

rrr314159