Faster Memcopy ...

dedndave · March 22, 2015, 12:11:12 AM

hello, Alex

that's a lot of English - lol

Gunther · March 22, 2015, 12:21:42 AM

Hi Alex,

Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts. :t I missed you.

Gunther

nidud · March 22, 2015, 02:03:27 AM

deleted

Antariy · March 22, 2015, 06:08:02 AM

Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex

that's a lot of English - lol

Hello Dave :t

Is that more or less proper English? Sometimes it isn't, yeah

Quote from: Gunther on March 22, 2015, 12:21:42 AM
Hi Alex,

Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts. :t I missed you.

Gunther

Thank you, Gunther!

:redface:

Quote from: nidud on March 22, 2015, 02:03:27 AM
Quote from: Antariy on March 21, 2015, 11:08:43 PM
Code Select Expand
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh ; with this db instruction and edx,dword ptr -10 ; same same ?...

well, here is a similar version using 32 bytes

I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?
Also, what means "similar version" - similar to which? To AxStrLenSSE? Well, if so, then it's actually not very similar. The most important is that your version does pointer advancing BEFORE data fetch. That was the thing which I was described in the post commenting the rrr's "edition" on the algo - but I did not accept that edit as a replacement to the current AxStrLenSSE implementation, I thought it was clear said that I assume the original version as the best possible for that kind of code (simple, single XMM reg using).
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.
Also XMM is not preserved... so actually that is strange "reproduction" of the algo with broken/not fully repeated logic. XMM isn't saved at all, instead of ECX saved EDX but it saved in a wrong way - below ESP. And the pointer advancing before data fetch - which slows the algo down.

Quote from: nidud on March 20, 2015, 08:22:02 AM

My point was that you seem to think in size of registers and not in chunk size :P

That was strange point if you just will, again, yes, re-read the first post here, which maybe is not crytally obvious and clear, but described everything which was discussed here on the topic of StrCmp. Probably the thread should be split into two topics, because MemCopy gone into StrCmp ::)

There was said about granularity of the memory to the page size, and about possibility of safe algos work, about power of two "data grabbing sizes" (which were then corrected to more suitable word "chunk"). And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post. The algos may be safe reading more than byte, the "read ahead" actually is not that in the right sense, and this "common technique" which is used widely by many hobby programmers over the world is buggy, because real read ahead is a way to crash code when accessed to the buffers.

Taking in account all that, it's strange that you do such a conclusions ::)

Quote from: nidud on March 22, 2015, 02:03:27 AM
Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

Padding is not what I said. I said - just create different segment for every code example. Like that:

SEG1 SEGMENT PARA PUBLIC 'CODE'
tested code 1 here
SEG1 ENDS

SEG2 SEGMENT PARA PUBLIC 'CODE'
tested code 2 here
SEG2 ENDS

... etc ...

When you write your source in this way, then every code piece will be placed in its own section in the PE file, and it will be placed in the page-granulared "sections" in memory - every piece in its own page in memory, starting with address aligned to 4096. This solves the code location problems as well - you might try it for your CPU. And it's more comfortable because does not limit the type of code to test to only location independed code. Actually I was suggested this way of fixing code location problems in some places on the forum several times.

hutch-- · March 22, 2015, 06:56:15 AM

The only really reliable way to test algorithms that are subject to code location is to put each algo in a separate executable using the identical test conditions. The problem seems to vary with the hardware but if you are performing a simple test between 2 or more algos in the same test piece, you can move the algos around to see if their timing changes. Occasionally doing a simple "align ##" will stabilise the timings but I have seen enough algorithms that do not respond to alignment.

There is another factor that I use to see with some of Lingo's test pieces, one algo leaves the following algo with a mess to clean up and to get a reliable timing for the effected algo, you had to comment out the first one.

nidud · March 22, 2015, 07:24:48 AM

deleted

rrr314159 · March 22, 2015, 06:27:42 PM

Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Here are the resulting runs:

Code Select

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8         906             854             852             905

    31        1147            1020            1019            1074

   271        4024            4142            4020            3924

  2014       26115           24593           24595           25818

 62159      757816          747523          748235          757502

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        1184            1157            1132            1241

    31        1399            1391            1395            1425

   271        4508            4368            4432            4522

  2014       25622           25036           25018           25604

 62159      757612          747909          746568          757986


AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2124            1551            1551            2319

    31        2526            1944            1926            2494

   271        6220            5679            5676            6416

  2014       29950           30171           30869           30104

 62159      872547          886154          887221          871530

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2776            2320            2319            2622

    31        2795            2420            2418            2797

   271        6016            5461            5476            6055

  2014       30809           31229           31080           30842

 62159      887148          887519          888207          889350

Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

Code Select

; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
    
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:

Bottom line, I can't believe it's right to waste those 18 bytes.

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

Zip has "testStrLen.asm" test program.

nidud · March 22, 2015, 10:34:21 PM

deleted

Gunther · March 23, 2015, 07:08:14 AM

Hi nidud,

Quote from: nidud on March 22, 2015, 10:34:21 PM
Here is the Intel version of strlen from 2011:

interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Gunther

rrr314159 · March 23, 2015, 08:55:27 AM

Quote from: Gunther on March 23, 2015, 07:08:14 AM
interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Not hard, but tedious! If someone wants to do it and post the results that would be appreciated. There are lots of defines like L(label) which (AFAIK) must be made into function MACROs - or else replace all occurrences by hand. There are the "." labels to fix. There's the multiline statements, with ;'s separating, to deal with; and of course reverse the order of about 400 commands. The #'s, %'s and the $'s are easy. Normally this routine would be about 20 or 30 lines (in fact that's the whole point, that he's unrolled everything to the max) and the job is trivial. OTOH if there were a few thousand lines, it would be rather fun to make .bat files and MACROs to automate the conversion. But for 700 lines not really worth the effort. Tedious. Oh, and don't forget to fix that egregious spelling mistake "calee".

Brings up a question, why is it AT&T syntax? Obviously this routine is not for Windows. Note that he uses xmm0-xmm3 then skips to xmm6, in Windows you'd use 4 and 5 wouldn't you? Reason it might matter, what is the target (type of) processor? If it's for handhelds it may not be optimal for us, those processors might not have good branch prediction, for instance, or other differences. I don't know, u understand, but he has some rather surprising techniques. Like, avoids branches like the plague; avoids bsf, prefers pminub to pcmepqb, other little things. These might be great techniques for us, or, as I wonder, might be designed for different processors (say, Atom) or environments. It also seems optimized for very small strings.

And, it doesn't seem very professional. All these lines, and he still doesn't have a MAXLEN - if the string isn't terminated the routine will dive right off the deep end. And why is Intel making it public anyway? Is this really their best effort? I'll bet it's not what Windows is using.

Anyway, it would be nice if someone would translate it, I'll be happy to test it, wouldn't be amazed if it's not all that impressive speedwise - in our typical PC environment.

[edit] Of course I'm pretty ignorant about these issues, don't mean to pretend I know what I'm talking about. Just some thoughts/questions that occurred to me

nidud · March 23, 2015, 11:04:49 PM

deleted

Gunther · March 24, 2015, 12:36:04 AM

Hi nidud,

could you post the entire code as attachment, please? It's much easier for me to do the conversion. Thank you.

Gunther

nidud · March 24, 2015, 01:13:21 AM

deleted

jj2007 · March 24, 2015, 02:30:50 AM

I must confess I don't understand what your testbed is showing. Can you explain?

-- string length 0..40
862314 cycles - ( 0) proc_0: crt_strlen
513683 cycles - (105) proc_1: SSE 32
-- string length 40..80
940951 cycles - ( 0) proc_0: crt_strlen
331639 cycles - (105) proc_1: SSE 32

- what are the numbers in brackets?
- why does SSE 32 run faster with the longer string?

nidud · March 24, 2015, 02:47:14 AM

deleted

The MASM Forum

News:

Faster Memcopy ...

dedndave

Gunther

nidud

Antariy

hutch--

nidud

rrr314159

nidud

Gunther

rrr314159

nidud

Gunther

nidud

jj2007

nidud