from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders

But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either
Thanks for giving me the tail end credit. Most would grab the code and run.
I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything
It was slower than the REP MOVSD code.
I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"
One of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Sure, that seems like an excellent approach.
I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:
mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2969 2759 2566
31 2706 2651 2559
271 6891 8023 6884
2014 15540 39733 15315
262159 5258012 5218352 5255524
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1441 1350 1386
31 1402 1346 1499
271 3655 5698 3788
2014 13863 26579 14449
262159 5318643 5385829 5317655
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2688 2730 2538
31 1452 1377 1279
271 3468 4053 3338
2014 13658 25588 13532
262159 5259190 5255524 5259200
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1465 1510 1352
31 1388 1327 1286
271 3577 5337 3571
2014 13922 25698 14171
262159 5370304 5417275 5319313
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2689 2730 2539
31 1428 1437 1429
271 3441 4143 3521
2014 13636 25594 13482
262159 5211287 5219878 5276594
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1409 1378 1335
31 1449 1387 1281
271 3603 5482 3889
2014 13870 25605 14170
262159 5372493 5509407 5361170
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2709 2754 2550
31 1439 1407 1353
271 3529 4226 3449
2014 14126 25590 13531
262159 5209791 5210645 5211581
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1416 1385 1358
31 1397 1338 1291
271 3483 5454 3570
2014 13910 25713 14117
262159 5315510 5381937 5314749
Press any key to continue ...
My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.
Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).
... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.
Well I reckon we've beaten memory-moving to death! I'm working on some
much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!