Well, try yourself... I have rearranged the columns so that you can see your algo side-by-side with CRT and (to the right) good ol' movs.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Algo memcpy memcpy_S MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT Guga rep movs movdqa lps+hps movdqa movdqa Masm32 Habran
dest-al psllq CeleronM dest-al src-al library
Code size ? 88 59 291 222 200 269 33 104
---------------------------------------------------------------------------------------------
2048, d0s0-0 546 550 564 359 424 424 359 561 541
2048, d1s1-0 719 800 573 410 482 473 425 1060 797
2048, d7s7-0 723 799 574 412 474 474 411 1059 797
2048, d7s8-1 809 543 835 1017 563 566 567 802 543
2048, d7s9-2 809 799 832 1017 563 566 567 1058 797
2048, d8s7+1 802 544 808 872 563 572 565 804 605
2048, d8s8-0 728 541 550 404 479 465 402 553 541
2048, d7s3+4 561 798 573 901 576 577 573 1058 802
2048, d3s7-4 551 802 575 1040 563 566 567 1060 814
2048, d8s9-1 820 544 809 994 576 575 567 804 606
2048, d9s7+2 808 798 846 862 565 564 571 1066 798
2048, d9s8+1 808 543 846 862 563 566 565 803 543
2048, d9s9-0 721 808 574 411 474 487 409 1061 798
2048, d15s15 723 798 588 410 470 471 408 1059 814