Faster Memcopy ...

rrr314159 · March 03, 2015, 02:40:50 PM

In this thread Habran, jj2007 and others submitted memory copy algorithms that were tested in jj's program MemCopyAlgos, which ran them on a block of 2048 bytes with all possible alignments. (BTW the thread is a lot of fun to read). I substituted a new algo for Habran's "Ferrari" (because I had trouble inserting it elsewhere in the program.)

This new algo is fastest on every measure: beat every other algo on every alignment. Here are typical runs on Intel and AMD machines (rrr314's Ferrari in last column):

Code Select

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      114
------------------------------------------------------------------------------------
2048, d0s0-0      159      199      242      239      239      218      201      144
2048, d1s1-0      255      231      263      270      258      249      257      208
2048, d7s7-0      256      234      265      272      260      250      257      208
2048, d7s8-1      260      244      657      473      245      248      257      216
2048, d7s9-2      259      245      657      473      245      248      257      208
2048, d8s7+1      252      239      641      471      246      234      257      208
2048, d8s8-0      256      234      288      290      283      268      257      208
2048, d8s9-1      252      243      649      474      245      248      257      208
2048, d9s7+2      257      239      634      471      246      237      257      209
2048, d9s8+1      257      239      634      471      246      237      257      208
2048, d9s9-0      256      234      275      272      268      253      257      208
2048, d15s15      256      235      274      274      272      254      257      208


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
2048, d0s0-0      170      250      262      326      327      274      278      162
2048, d1s1-0      419      385      384      389      494      348      796      238
2048, d7s7-0      375      477      350      436      407      357      758      245
2048, d7s8-1      628      682      699      402      250      303      617      224
2048, d7s9-2      598      665      757      420      250      296      621      223
2048, d8s7+1      603      643      702      409      249      257      617      219
2048, d8s8-0      356      345      315      367      386      318      298      228
2048, d8s9-1      609      651      701      399      249      298      591      217
2048, d9s7+2      587      661      690      405      274      244      596      225
2048, d9s8+1      587      660      707      403      249      250      604      225
2048, d9s9-0      356      376      347      405      386      331      667      224
2048, d15s15      356      374      334      398      376      331      680      224


--- ok ---

It simply uses all 8 xmm registers to move a large chunk of 128 bytes per loop ... nothing fancy at all. the algo is also shorter, at 157 bytes, than all but the most basic competitors.

It could be made somewhat faster, but I didn't bother - just wanted to prove the point that this basic idea works. BTW I have found the idea on the net since my last post; google "Mark Larson optimize assembler" for a discussion. Also see FASM's discussion board. I'm surprised the idea is so obscure. Still haven't found the "more advanced" techniques used in my "faster tokenizer" algo - which makes me wonder if there's something wrong with them.

It's written in 32-bits; learned my lesson from last time, there's nothing non-standard about it.

Unfortunately sometimes the display might not fit right on your command window; but I didn't change anything in jj's program more than necessary. Also note that the very first number generated (upper left hand corner) is, I believe, artificially low due to a start-up problem I intend to discuss in more detail elsewhere (my algo beat it anyway).

The zip contains jj's testing program, renamed MemCopyAlgosrrr; and the exe thereof. My algo is in it, called "xmemcpy".

[edit] a more complete version, faster for larger memory copies, is in the zip a few posts below

[edit] even more complete version is in reply 25, next page. But the one here is a lot simpler (half as many lines) and shows the basic idea: using all 8 registers to copy 128 bit chunks.

hutch-- · March 03, 2015, 02:55:16 PM

rrr,

Something we determined a while ago is that REP MOVS/B/D is slow under about 500 bytes. After that internal special case circuitry kicks in and it gets a lot faster. Have you had a paddle with different buffer sizes for the copy operation ?

rrr314159 · March 03, 2015, 03:51:52 PM

No; I didn't "optimize" it specifically for 2048 (the standard number used in that thread) but, didn't try it in any other region. As mentioned, just wanted to demonstrate feasibility of the idea. But I'll give it a shot in a while and let you know

rrr314159 · March 03, 2015, 05:17:16 PM

Ran it at 20480 and 100000. My algo still does alright, is best at some settings. But at 20480, MemCoP4 and MemCoC2 are overall ahead, my algo about third. At 100000 it's surprising how neck-and-neck they all are, except MemCo2 and MemCoC3 which are slower.

I'm using all the xmm reg's, all unaligned. So for 20480 I bet I can still beat them all, since the two current winners are using just one or two xmm's, but aligned (either source or dest). So I'll try aligning one or the other also, and (I *think*) the strategy of using all 8 will give me the advantage. At least, it's proved effective in the 2 cases I've tried so far. Tomorrow I'll check it out.

However dunno about the larger numbers (100,000). It's quite amazing how all the different algorithms converge to the same speed up there (with 2 exceptions) - did u know they'd do that? The 2 xmm-aligned algos are only a little better, less than 1%. BTW jj has a comment recommending not to exceed 100,000 with this program; someday I'll check out some bigger numbers with my own test prog.

Well, thanks for the suggestion! Here are the numbers I got:

Code Select


Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
20480, d0s0-     2796     2760     2795     2835     2803     2786     2761     2640
20480, d1s1-     2935     2782     2809     2815     2838     2811     3036     4244
20480, d7s7-     2899     2786     2816     2848     2848     2804     3038     4343
20480, d7s8-     9095     9091     5962     5716     3156     3125     9210     3114
20480, d7s9-     9092     9090     5968     5720     3129     3124     9357     3101
20480, d8s7+     3257     3163     5988     5270     3149     3159     3225     3240
20480, d8s8-     2923     2793     2888     2825     2840     2835     3040     4330
20480, d8s9-     9086     9091     5962     5721     3165     3126     9065     3093
20480, d9s7+     3253     3159     5977     5264     3156     3131     3237     3205
20480, d9s8+     3267     3166     5975     5271     3115     3156     3256     3212
20480, d9s9-     2880     2826     2808     2833     2870     2861     3043     4339
20480, d15s1     2912     2822     2867     2820     2835     2844     3029     4269

--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
100000, d0s0    12981    12716    12930    12924    12930    12969    12717    12962
100000, d1s1    13047    12732    12978    12963    12992    12907    13614    13905
100000, d7s7    13046    12729    12946    12982    12944    12989    13624    13938
100000, d7s8    13655    13281    24727    19108    13681    13681    13639    13951
100000, d7s9    13641    13254    24695    19110    13702    13671    13599    13857
100000, d8s7    13588    13305    24930    20277    13651    13703    13622    13961
100000, d8s8    13075    12739    12986    12946    12973    12960    13650    13950
100000, d8s9    13619    13281    24846    19103    13681    13689    13648    13937
100000, d9s7    13649    13284    25049    20303    13680    13668    13656    13978
100000, d9s8    13643    13293    25028    20336    13583    13586    13620    13950
100000, d9s9    13036    12728    13005    12963    12962    12961    13636    13943
100000, d15s    13036    12708    12942    12965    12922    12931    13656    13950


--- ok ---

dedndave · March 03, 2015, 06:01:29 PM

Quote from: rrr314159 on March 03, 2015, 05:17:16 PM
At 100000 it's surprising how neck-and-neck they all are...

a sign that you may be hitting hardware limitations

jj2007 · March 03, 2015, 06:25:51 PM

It is certainly fast, and could be an option if you have a special need for repeatedly copying large blocks.

For MasmBasic, I chose rep movsd in the end, both on grounds of speed and convenience: It is a frequently used algo, and I wouldn't like a wrapper for preserving the xmm regs for each and every little memcopy.

hutch-- · March 03, 2015, 08:02:53 PM

Its been a long time since I did any benchmarking on memcopy but I found REP MOVS/B/D seemed to have all of the advantages on cached reads, non-cached writes and the optimal prefetch and I think Intel did the special case circuitry because its so commonly used over a long time. On sub 500 byte repeat copies from memory the incremented pointers (mov EAX/MOV AL) was faster on small buffers. I tend to look at the rest of the algo that does the memcopy to see what is worth the effort, there are enough times when a clunky fussy big mover is a PITA where a simple byte copy if faster than the algo calling it.

rrr314159 · March 04, 2015, 06:17:40 AM

Thanks all for your comments! Here's what I wrote b4 I saw them, and I think it agrees with your general attitude.

This new version has the destination aligned, and is fastest (vs. every algo at every alignment) at 20483 bytes. On AMD it's particularly clear but Intel also. At 100000 it's hard to say what's fastest, there are a few very close; varies between the two computers; you can look at the numbers. Certainly, my algo is no longer the "winner".

There are some obvious things to do to speed it up at the higher byte counts, but I'm sick of this - it's a pursuit well suited to masochists. First, it's very frustrating to get good timing numbers. Second, a change can be good in one circumstance (byte count, alignment, phase of the moon) but bad in another. It varies between computers, other progs running, etc. Third, the algo (which started out simple) is getting longer and longer, harder to understand/debug/deal with. Fourth, I noticed, reviewing old optimization pages (like Mark Larson's, a few years old) tricks that worked back then are obsolete now! Branch prediction is totally different than what he optimized for. Unaligned moves are pretty much just as fast as aligned. Non-temporal moves are completely useless on my machines. And so forth. So, all this effort will be meaningless soon. Fifth, when u get right down to it who cares about a few nanoseconds one way or the other?

The zip contains MemCopyAlgosrrr.asm and .exe, same as previous posting.

[edit] fixed 3 small things in zip, now it's cleaner; functionality unchanged, saves a couple nanoseconds, 2015/3/6

Code Select

--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     2821     2750     2804     2834     2841     2834     2777     2717
20483, d1s1-     2919     2784     2859     2831     2816     2863     3045     2737
20483, d7s7-     2885     2794     2805     2787     2787     2829     3029     2678
20483, d7s8-     9093     9095     5935     5696     3139     3096     9212     2830
20483, d7s9-     9092     9092     5940     5700     3139     3136     9361     2887
20483, d8s7+     3244     3146     5951     5249     3099     3089     3223     2931
20483, d8s8-     2919     2783     2795     2821     2796     2785     3035     2683
20483, d8s9-     9087     9089     5935     5700     3145     3092     9067     2893
20483, d9s7+     3258     3148     5949     5249     3086     3109     3245     2868
20483, d9s8+     3247     3158     5946     5243     3108     3142     3231     2890
20483, d9s9-     2920     2787     2843     2848     2862     2866     3067     2721
20483, d15s1     2902     2801     2858     2810     2839     2831     3052     2668


--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    13082    12709    12938    12972    12974    12942    12721    12956
100000, d1s1    13099    12702    12994    12958    12967    13022    13664    13020
100000, d7s7    13081    12739    13022    12983    12961    12991    13663    12999
100000, d7s8    13652    13312    24752    19100    13683    13696    13654    13537
100000, d7s9    13652    13267    24698    19113    13629    13704    13652    13624
100000, d8s7    13640    13315    24653    20291    13659    13684    13658    13570
100000, d8s8    13023    12695    12979    12983    12984    12966    13653    12951
100000, d8s9    13599    13287    24610    19086    13583    13645    13642    13647
100000, d9s7    13602    13305    25021    20289    13618    13634    13670    13578
100000, d9s8    13652    13317    24692    20279    13713    13681    13643    13533
100000, d9s9    13035    12718    12954    12993    12989    13064    13636    12974
100000, d15s    13071    12738    12989    13002    12948    13028    13658    12927

--- ok ---


AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     5989     5914     5920     5634     5725     5723     5743     4819
20483, d1s1-     6143     5738     5653     5616     5600     5506     9796     4846
20483, d7s7-     6018     5656     5694     5601     5629     5602     9668     4869
20483, d7s8-    10563    10941     9734     9322     7603     7613     9132     6136
20483, d7s9-    10578    10961     9736     9324     7613     7637     9994     6116
20483, d8s7+    10728    10852     7924     8886     7223     7194    10801     6290
20483, d8s8-     5960     5617     5682     5621     5606     5624     5615     4847
20483, d8s9-    10624    10974     9740     9322     7714     7608    10651     6106
20483, d9s7+    10787    10809     7824     8919     7190     7189     9878     6206
20483, d9s8+    10810    10818     7842     8917     7253     7223     8821     6201
20483, d9s9-     6059     5650     5675     5594     5577     5614     9622     4863
20483, d15s1     5861     5594     5680     5631     5568     5648     9694     4765


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    24435    22509    22496    22415    22002    22048    22183    22259
100000, d1s1    22379    22313    22026    22110    21864    21799    43627    22193
100000, d7s7    22365    22331    22055    22006    21805    22071    43413    22149
100000, d7s8    48009    47571    31597    32397    25654    24978    39742    24342
100000, d7s9    48044    47574    31809    32465    25620    24905    43194    24367
100000, d8s7    49095    49177    31628    31530    25843    24612    49296    25211
100000, d8s8    22408    22264    22047    22044    21643    21890    22259    22186
100000, d8s9    47976    47620    31494    32418    25651    24999    47788    24421
100000, d9s7    49815    49884    31259    31463    25324    24670    44214    25206
100000, d9s8    49755    49189    31743    31498    25266    24655    39641    25244
100000, d9s9    22411    22149    21979    22138    21807    21903    43295    22222
100000, d15s    22365    22310    22133    22023    21826    21728    43418    22051

rrr314159 · March 04, 2015, 06:32:41 AM

ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.

hutch-- · March 04, 2015, 09:15:18 AM

Here are the results on my aging i7.

Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 3530 3520 3502 3497 3499 3511 3655 3376
20483, d1s1- 3752 3554 3535 3538 3541 3540 3969 3341
20483, d7s7- 3758 3560 3534 3538 3541 3544 3921 3346
20483, d7s8- 8911 8889 7560 7840 4461 4465 9022 3770
20483, d7s9- 8909 8890 7560 7840 4463 4464 9414 3763
20483, d8s7+ 4489 4445 7352 7038 4417 4411 4496 3814
20483, d8s8- 3759 3558 3537 3543 3536 3544 3913 3344
20483, d8s9- 8901 8890 7554 7839 4466 4465 8867 3763
20483, d9s7+ 4490 4448 7345 7038 4413 4411 4499 3815
20483, d9s8+ 4489 4446 7345 7042 4405 4408 4468 3818
20483, d9s9- 3759 3560 3539 3538 3535 3538 3914 3346
20483, d15s1 3581 3559 3530 3528 3526 3535 3951 3311

--- ok ---

KeepingRealBusy · March 04, 2015, 10:23:39 AM

rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.

jj2007 · March 04, 2015, 10:26:58 AM

Quote from: rrr314159 on March 04, 2015, 06:32:41 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.

If speed matters, it looks like a really good alternative. As a general purpose algo, it would fail already when looking at the minimum chunk size...

rrr314159 · March 04, 2015, 07:19:34 PM

@KeepingRealBusy,

thx for the suggestion. The way I did it is more obvious; u can check it out in the zip, at beginning of "xmemcpy" proc. Hard to say which is better, probably varies with circumstances. Main advantage of my approach is, it fit with the existing algo; didn't have to change a single line, just added 20, whereas yours would require a total rewrite. It's interesting that these are 2 completely different ways to handle the same seemingly straightforward problem.

@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also; for < 128 it uses single xmm reg (chunk=16); less than that, moves byte by byte (chunk = 1 :) Could get more complex, use chunk sizes of 2, 4, 8, and 32, 64; I saw an Agner Fog algo that did that; but it's all, really, of academic interest only, IMHO

@hutch thx for checking it out; give your old machine a rest now, it's earned it ;)

jj2007 · March 04, 2015, 08:15:13 PM

Quote from: rrr314159 on March 04, 2015, 07:19:34 PM@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also

Sorry, had not seen that one :redface:

Code Select

    shr ecx, 7                  ; divide by 128 (chunk size)

    loop128:                    ; do chunks of 128
        jle endloop128

rrr314159 · March 05, 2015, 08:00:41 AM

Let me apologize for that sloppy code which aligns dest block. There are 3 places where instructions are unnecessarily executed. I knew about them when I posted it, but 1) makes no difference at all to the final results 2) didn't think anyone would read it, and 3) I was so tired of timing, timing, timing (to paraphrase Hamlet) that the extra 20 seconds to make the changes - and couple minutes to test them - was more than I could stand. If the changes had to be made I would have just chucked the whole thing.

But the original code is fine (I think) and compact enuff that u have to read carefully, as u found out! At one point, condition code is set, then not checked for another 21 lines (of course it's unchanged by those intervening lines). This gives the processor max time to predict the branch correctly, and makes the code harder to read - which, sometimes, I consider a plus :)

The MASM Forum

News:

Faster Memcopy ...

rrr314159

hutch--

rrr314159

rrr314159

dedndave

jj2007

hutch--

rrr314159

rrr314159

hutch--

KeepingRealBusy

jj2007

rrr314159

jj2007

rrr314159