I'll be darned - this is exactly the idea we've been beating to death in the laboratory! I'd say William Chan stole it from me, except he's 9 years prior, so it would be a hard sell. But this algo has major drawbacks,
Note though, that you'll need to give it 16-byte aligned memory and it copies in 128-byte blocks.
Also prefetchnta seems useless, and movntdq worse-than-useless on my modern machine. Admittedly only tried them once but also saw ref's saying the same thing, that modern processors don't get much from them. (Of course u can't trust ref's)
Yes, actually it was not so useful as it was told (you know, the "loud words" on the "technology advances" are usually much part just a words with a little true) even on not very modern hardware.
I find that incrementing edi and esi midway through the list of mov's is better. Keeps the max offset down to 30h, no reason it should make a difference, but seems to help. And of course u should dec ebx long b4 the jnz branch, maximizes processor's ability to predict branch correctly in advance. Minor points, of course; see laboratory thread for a couple dozen more if interested
Not too big offset has the influence on timing, yes, thought there is not "obvious reason", but it does so.
Much more big point: the code doesn't support a "precise copying" - it copies just in the 128 bytes blocks and doesn't support the precise tails copying that less than 128 bytes. Very simple code.
Dunno what this is doing here, would be more relevant over in the laboratory, but it was such a surprise to see it I had to comment.
It was on the blog which was pointed as a reference on the wikipedia's article, which was pointed by Jochen, probably it should be clear from the posts above. And, being a "Real Lazy Coder" (TM), I did not bother to point that link in the thread with memcopy as it was not open in the browser. I did read it earlier (did not posted there as tend to agree with Hutch's and Jochen's point of view on that subject), too, so knew about that topic on the forum going at the time, so that's why pointed the link to some "unknown memcopy" algo here.