Thanks all for your comments! Here's what I wrote b4 I saw them, and I think it agrees with your general attitude.
This new version has the destination aligned, and is fastest (vs. every algo at every alignment) at 20483 bytes. On AMD it's particularly clear but Intel also. At 100000 it's hard to say what's fastest, there are a few very close; varies between the two computers; you can look at the numbers. Certainly, my algo is no longer the "winner".
There are some obvious things to do to speed it up at the higher byte counts, but I'm sick of this - it's a pursuit well suited to masochists. First, it's very frustrating to get good timing numbers. Second, a change can be good in one circumstance (byte count, alignment, phase of the moon) but bad in another. It varies between computers, other progs running, etc. Third, the algo (which started out simple) is getting longer and longer, harder to understand/debug/deal with. Fourth, I noticed, reviewing old optimization pages (like Mark Larson's, a few years old) tricks that worked back then are obsolete now! Branch prediction is totally different than what he optimized for. Unaligned moves are pretty much just as fast as aligned. Non-temporal moves are completely useless on my machines. And so forth. So, all this effort will be meaningless soon. Fifth, when u get right down to it who cares about a few nanoseconds one way or the other?
The zip contains MemCopyAlgosrrr.asm and .exe, same as previous posting.
[edit] fixed 3 small things in zip, now it's cleaner; functionality unchanged, saves a couple nanoseconds, 2015/3/6
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 2821 2750 2804 2834 2841 2834 2777 2717
20483, d1s1- 2919 2784 2859 2831 2816 2863 3045 2737
20483, d7s7- 2885 2794 2805 2787 2787 2829 3029 2678
20483, d7s8- 9093 9095 5935 5696 3139 3096 9212 2830
20483, d7s9- 9092 9092 5940 5700 3139 3136 9361 2887
20483, d8s7+ 3244 3146 5951 5249 3099 3089 3223 2931
20483, d8s8- 2919 2783 2795 2821 2796 2785 3035 2683
20483, d8s9- 9087 9089 5935 5700 3145 3092 9067 2893
20483, d9s7+ 3258 3148 5949 5249 3086 3109 3245 2868
20483, d9s8+ 3247 3158 5946 5243 3108 3142 3231 2890
20483, d9s9- 2920 2787 2843 2848 2862 2866 3067 2721
20483, d15s1 2902 2801 2858 2810 2839 2831 3052 2668
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
100000, d0s0 13082 12709 12938 12972 12974 12942 12721 12956
100000, d1s1 13099 12702 12994 12958 12967 13022 13664 13020
100000, d7s7 13081 12739 13022 12983 12961 12991 13663 12999
100000, d7s8 13652 13312 24752 19100 13683 13696 13654 13537
100000, d7s9 13652 13267 24698 19113 13629 13704 13652 13624
100000, d8s7 13640 13315 24653 20291 13659 13684 13658 13570
100000, d8s8 13023 12695 12979 12983 12984 12966 13653 12951
100000, d8s9 13599 13287 24610 19086 13583 13645 13642 13647
100000, d9s7 13602 13305 25021 20289 13618 13634 13670 13578
100000, d9s8 13652 13317 24692 20279 13713 13681 13643 13533
100000, d9s9 13035 12718 12954 12993 12989 13064 13636 12974
100000, d15s 13071 12738 12989 13002 12948 13028 13658 12927
--- ok ---
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 5989 5914 5920 5634 5725 5723 5743 4819
20483, d1s1- 6143 5738 5653 5616 5600 5506 9796 4846
20483, d7s7- 6018 5656 5694 5601 5629 5602 9668 4869
20483, d7s8- 10563 10941 9734 9322 7603 7613 9132 6136
20483, d7s9- 10578 10961 9736 9324 7613 7637 9994 6116
20483, d8s7+ 10728 10852 7924 8886 7223 7194 10801 6290
20483, d8s8- 5960 5617 5682 5621 5606 5624 5615 4847
20483, d8s9- 10624 10974 9740 9322 7714 7608 10651 6106
20483, d9s7+ 10787 10809 7824 8919 7190 7189 9878 6206
20483, d9s8+ 10810 10818 7842 8917 7253 7223 8821 6201
20483, d9s9- 6059 5650 5675 5594 5577 5614 9622 4863
20483, d15s1 5861 5594 5680 5631 5568 5648 9694 4765
--- ok ---
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
100000, d0s0 24435 22509 22496 22415 22002 22048 22183 22259
100000, d1s1 22379 22313 22026 22110 21864 21799 43627 22193
100000, d7s7 22365 22331 22055 22006 21805 22071 43413 22149
100000, d7s8 48009 47571 31597 32397 25654 24978 39742 24342
100000, d7s9 48044 47574 31809 32465 25620 24905 43194 24367
100000, d8s7 49095 49177 31628 31530 25843 24612 49296 25211
100000, d8s8 22408 22264 22047 22044 21643 21890 22259 22186
100000, d8s9 47976 47620 31494 32418 25651 24999 47788 24421
100000, d9s7 49815 49884 31259 31463 25324 24670 44214 25206
100000, d9s8 49755 49189 31743 31498 25266 24655 39641 25244
100000, d9s9 22411 22149 21979 22138 21807 21903 43295 22222
100000, d15s 22365 22310 22133 22023 21826 21728 43418 22051