Author Topic: Faster Memcopy ...  (Read 41209 times)

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Faster Memcopy ...
« Reply #60 on: March 22, 2015, 12:11:12 AM »
hello, Alex   :biggrin:

that's a lot of English - lol

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Faster Memcopy ...
« Reply #61 on: March 22, 2015, 12:21:42 AM »
Hi Alex,

hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #62 on: March 22, 2015, 02:03:27 AM »
Code: [Select]
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
and edx,dword ptr -10 ; same same ?...

well, here is a similar version using 32 bytes
Code: [Select]
xorps xmm1,xmm1
mov eax,[esp+4]
mov [esp-4],edx
mov ecx,eax
and eax,-32
and ecx,32-1
or edx,-1
shl edx,cl
pcmpeqb xmm1,[eax]
pmovmskb ecx,xmm1
and ecx,edx
jnz done
pxor xmm1,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
@@:
add eax,32
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
shl ecx,16
or ecx,edx
jz @B
done:
bsf ecx,ecx
sub eax,[esp+4]
mov edx,[esp-4]
add eax,ecx
ret 4

Code: [Select]
result:
   972055 cycles - (105) proc_5: SSE 32 - safe
  1010521 cycles - ( 99) proc_9: AxStrLenSSE (nidud)
  1265207 cycles - (104) proc_6: AxStrLenSSE
  1289162 cycles - ( 86) proc_8: AxStrLenSSE (rrr)
  1565286 cycles - (  0) proc_2: Len()
  2415594 cycles - (  0) proc_0: crt_strlen
  3018226 cycles - (  0) proc_1: len()

Quote
, or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?
yes

Quote
Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Yes, but I told about SSE one and just did not mention other.

My point was that you seem to think in size of registers and not in chunk size  :P

Quote
OK, good. Assembler is JWasm?

yes, modified version.

Quote
Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here.

Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.

Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

I also added a batch file to the BIN directory to build the project from the editor.
BIN\build.bat
Code: [Select]
makeand added a short-key to DZ.INI
Code: [Select]
[AltF9]
asm = build.bat
lst = build.bat
makefile= build.bat

The problem is that the editor passes the current file as argument to external tools, so a batch file is needed. Well, this lets you edit, compile, and execute the test from the editor.


Antariy

  • Member
  • ****
  • Posts: 551
Re: Faster Memcopy ...
« Reply #63 on: March 22, 2015, 06:08:02 AM »
hello, Alex   :biggrin:

that's a lot of English - lol

Hello Dave :t

Is that more or less proper English? Sometimes it isn't, yeah :greensml:

Hi Alex,

hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther

Thank you, Gunther! :biggrin: :redface:



Code: [Select]
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
and edx,dword ptr -10 ; same same ?...

well, here is a similar version using 32 bytes

I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?
Also, what means "similar version" - similar to which? To AxStrLenSSE? Well, if so, then it's actually not very similar. The most important is that your version does pointer advancing BEFORE data fetch. That was the thing which I was described in the post commenting the rrr's "edition" on the algo - but I did not accept that edit as a replacement to the current AxStrLenSSE implementation, I thought it was clear said that I assume the original version as the best possible for that kind of code (simple, single XMM reg using).
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.
Also XMM is not preserved... so actually that is strange "reproduction" of the algo with broken/not fully repeated logic. XMM isn't saved at all, instead of ECX saved EDX but it saved in a wrong way - below ESP. And the pointer advancing before data fetch - which slows the algo down.



My point was that you seem to think in size of registers and not in chunk size  :P


That was strange point if you just will, again, yes, re-read the first post here, which maybe is not crytally obvious and clear, but described everything which was discussed here on the topic of StrCmp. Probably the thread should be split into two topics, because MemCopy gone into StrCmp ::)

There was said about granularity of the memory to the page size, and about possibility of safe algos work, about power of two "data grabbing sizes" (which were then corrected to more suitable word "chunk"). And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post. The algos may be safe reading more than byte, the "read ahead" actually is not that in the right sense, and this "common technique" which is used widely by many hobby programmers over the world is buggy, because real read ahead is a way to crash code when accessed to the buffers.

Taking in account all that, it's strange that you do such a conclusions ::)

Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

Padding is not what I said. I said - just create different segment for every code example. Like that:

SEG1 SEGMENT PARA PUBLIC 'CODE'
tested code 1 here
SEG1 ENDS

SEG2 SEGMENT PARA PUBLIC 'CODE'
tested code 2 here
SEG2 ENDS

... etc ...


When you write your source in this way, then every code piece will be placed in its own section in the PE file, and it will be placed in the page-granulared "sections" in memory - every piece in its own page in memory, starting with address aligned to 4096. This solves the code location problems as well - you might try it for your CPU. And it's more comfortable because does not limit the type of code to test to only location independed code. Actually I was suggested this way of fixing code location problems in some places on the forum several times.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7542
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #64 on: March 22, 2015, 06:56:15 AM »
The only really reliable way to test algorithms that are subject to code location is to put each algo in a separate executable using the identical test conditions. The problem seems to vary with the hardware but if you are performing a simple test between 2 or more algos in the same test piece, you can move the algos around to see if their timing changes. Occasionally doing a simple "align ##" will stabilise the timings but I have seen enough algorithms that do not respond to alignment.

There is another factor that I use to see with some of Lingo's test pieces, one algo leaves the following algo with a mess to clean up and to get a reliable timing for the effected algo, you had to comment out the first one.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #65 on: March 22, 2015, 07:24:48 AM »
I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?

It's a type of English used in some parts of the world
"same same but different"  :P

What I meant was that you could extend the code by using DWORD PTR instead of DB ...

Quote
Also, what means "similar version" - similar to which? To AxStrLenSSE?

Well, given the pointer is aligned you may compare it directly, as you do, and thereby reduce code size, so in that sense it's a similar approach.

Quote
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.

True.

Quote
And the pointer advancing before data fetch - which slows the algo down.

Are you sure ?

Quote
And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post.

True, that was a wrong assumption. On viewing the strlen function made by Agner Fog I assumed the alignment was only done for speed and not safety.

Quote
Padding is not what I said. I said - just create different segment for every code example. Like that:

It may be that paging also have an impact on timings, but the test done in this case was on small functions all combined less than a page. There is a tread for this somewhere and the results are different for most machines.

Quote
And it's more comfortable because does not limit the type of code to test to only location independed code.

The test also have this option. The functions you want to test may all be in the same file, or externals like crt_strlen, or macros like Len() and len().
« Last Edit: March 24, 2015, 06:14:33 AM by nidud »

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #66 on: March 22, 2015, 06:27:42 PM »
Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Here are the resulting runs:

Code: [Select]
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8         906             854             852             905

    31        1147            1020            1019            1074

   271        4024            4142            4020            3924

  2014       26115           24593           24595           25818

 62159      757816          747523          748235          757502

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        1184            1157            1132            1241

    31        1399            1391            1395            1425

   271        4508            4368            4432            4522

  2014       25622           25036           25018           25604

 62159      757612          747909          746568          757986


AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2124            1551            1551            2319

    31        2526            1944            1926            2494

   271        6220            5679            5676            6416

  2014       29950           30171           30869           30104

 62159      872547          886154          887221          871530

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2776            2320            2319            2622

    31        2795            2420            2418            2797

   271        6016            5461            5476            6055

  2014       30809           31229           31080           30842

 62159      887148          887519          888207          889350

Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

Code: [Select]
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:

Bottom line, I can't believe it's right to waste those 18 bytes.

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

Zip has "testStrLen.asm" test program.
I am NaN ;)

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #67 on: March 22, 2015, 10:34:21 PM »
I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted!

  :biggrin:

The fast string functions used these days are normally huge in size, and they go a long way to remove branching to save a few cycles.

Here is the Intel version of strlen from 2011:
Code: [Select]
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice,
    * this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice,
    * this list of conditions and the following disclaimer in the documentation
    * and/or other materials provided with the distribution.

    * Neither the name of Intel Corporation nor the names of its contributors
    * may be used to endorse or promote products derived from this software
    * without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifndef USE_AS_STRCAT

# ifndef STRLEN
#  define STRLEN strlen
# endif

# ifndef L
#  define L(label) .L##label
# endif

# ifndef cfi_startproc
#  define cfi_startproc .cfi_startproc
# endif

# ifndef cfi_endproc
#  define cfi_endproc .cfi_endproc
# endif

/* calee safe register only for strnlen is required */

# ifdef USE_AS_STRNLEN
#  ifndef cfi_rel_offset
#   define cfi_rel_offset(reg, off) .cfi_rel_offset reg, off
#  endif

#  ifndef cfi_restore
#   define cfi_restore(reg) .cfi_restore reg
#  endif

#  ifndef cfi_adjust_cfa_offset
#   define cfi_adjust_cfa_offset(off) .cfi_adjust_cfa_offset off
#  endif
# endif

# ifndef ENTRY
#  define ENTRY(name) \
.type name,  @function; \
.globl name; \
.p2align 4; \
name: \
cfi_startproc
# endif

# ifndef END
#  define END(name) \
cfi_endproc; \
.size name, .-name
# endif

# define PARMS 4
# define STR PARMS
# define RETURN ret

# ifdef USE_AS_STRNLEN
#  define LEN PARMS + 8
#  define CFI_PUSH(REG) \
cfi_adjust_cfa_offset (4); \
cfi_rel_offset (REG, 0)

#  define CFI_POP(REG) \
cfi_adjust_cfa_offset (-4); \
cfi_restore (REG)

#  define PUSH(REG) pushl REG; CFI_PUSH (REG)
#  define POP(REG) popl REG; CFI_POP (REG)
#  undef RETURN
#  define RETURN POP (%edi); ret; CFI_PUSH(%edi);
# endif

.text
ENTRY (STRLEN)
mov STR(%esp), %edx
# ifdef USE_AS_STRNLEN
PUSH (%edi)
movl LEN(%esp), %edi
sub $4, %edi
jbe L(len_less4_prolog)
# endif
#endif
xor %eax, %eax
cmpb $0, (%edx)
jz L(exit_tail0)
cmpb $0, 1(%edx)
jz L(exit_tail1)
cmpb $0, 2(%edx)
jz L(exit_tail2)
cmpb $0, 3(%edx)
jz L(exit_tail3)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less8_prolog)
#endif

cmpb $0, 4(%edx)
jz L(exit_tail4)
cmpb $0, 5(%edx)
jz L(exit_tail5)
cmpb $0, 6(%edx)
jz L(exit_tail6)
cmpb $0, 7(%edx)
jz L(exit_tail7)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less12_prolog)
#endif

cmpb $0, 8(%edx)
jz L(exit_tail8)
cmpb $0, 9(%edx)
jz L(exit_tail9)
cmpb $0, 10(%edx)
jz L(exit_tail10)
cmpb $0, 11(%edx)
jz L(exit_tail11)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less16_prolog)
#endif

cmpb $0, 12(%edx)
jz L(exit_tail12)
cmpb $0, 13(%edx)
jz L(exit_tail13)
cmpb $0, 14(%edx)
jz L(exit_tail14)
cmpb $0, 15(%edx)
jz L(exit_tail15)

pxor %xmm0, %xmm0
lea 16(%edx), %eax
mov %eax, %ecx
and $-16, %eax

#ifdef USE_AS_STRNLEN
and $15, %edx
add %edx, %edi
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
pxor %xmm1, %xmm1
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
pxor %xmm2, %xmm2
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
pxor %xmm3, %xmm3
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
mov %eax, %edx
and $63, %edx
add %edx, %edi
#endif

and $-0x40, %eax

.p2align 4
L(aligned_64_loop):
#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif
movaps (%eax), %xmm0
movaps 16(%eax), %xmm1
movaps 32(%eax), %xmm2
movaps 48(%eax), %xmm6
pminub %xmm1, %xmm0
pminub %xmm6, %xmm2
pminub %xmm0, %xmm2
pcmpeqb %xmm3, %xmm2
pmovmskb %xmm2, %edx
lea 64(%eax), %eax
test %edx, %edx
jz L(aligned_64_loop)

pcmpeqb -64(%eax), %xmm3
pmovmskb %xmm3, %edx
lea 48(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb %xmm1, %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb -32(%eax), %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb %xmm6, %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
L(exit):
sub %ecx, %eax
test %dl, %dl
jz L(exit_high)

mov %dl, %cl
and $15, %cl
jz L(exit_8)
test $0x01, %dl
jnz L(exit_tail0)
test $0x02, %dl
jnz L(exit_tail1)
test $0x04, %dl
jnz L(exit_tail2)
add $3, %eax
RETURN

.p2align 4
L(exit_8):
test $0x10, %dl
jnz L(exit_tail4)
test $0x20, %dl
jnz L(exit_tail5)
test $0x40, %dl
jnz L(exit_tail6)
add $7, %eax
RETURN

.p2align 4
L(exit_high):
mov %dh, %ch
and $15, %ch
jz L(exit_high_8)
test $0x01, %dh
jnz L(exit_tail8)
test $0x02, %dh
jnz L(exit_tail9)
test $0x04, %dh
jnz L(exit_tail10)
add $11, %eax
RETURN

.p2align 4
L(exit_high_8):
test $0x10, %dh
jnz L(exit_tail12)
test $0x20, %dh
jnz L(exit_tail13)
test $0x40, %dh
jnz L(exit_tail14)
add $15, %eax
L(exit_tail0):
RETURN

#ifdef USE_AS_STRNLEN

.p2align 4
L(len_less64):
pxor %xmm0, %xmm0
add $64, %edi

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
pxor %xmm1, %xmm1
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

#ifndef USE_AS_STRLCAT
movl LEN(%esp), %eax
RETURN
#else
jmp L(return_start_len)
#endif

.p2align 4
L(strnlen_exit):
sub %ecx, %eax

test %dl, %dl
jz L(strnlen_exit_high)
mov %dl, %cl
and $15, %cl
jz L(strnlen_exit_8)
test $0x01, %dl
jnz L(exit_tail0)
test $0x02, %dl
jnz L(strnlen_exit_tail1)
test $0x04, %dl
jnz L(strnlen_exit_tail2)
sub $4, %edi
jb L(return_start_len)
lea 3(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_8):
test $0x10, %dl
jnz L(strnlen_exit_tail4)
test $0x20, %dl
jnz L(strnlen_exit_tail5)
test $0x40, %dl
jnz L(strnlen_exit_tail6)
sub $8, %edi
jb L(return_start_len)
lea 7(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_high):
mov %dh, %ch
and $15, %ch
jz L(strnlen_exit_high_8)
test $0x01, %dh
jnz L(strnlen_exit_tail8)
test $0x02, %dh
jnz L(strnlen_exit_tail9)
test $0x04, %dh
jnz L(strnlen_exit_tail10)
sub $12, %edi
jb L(return_start_len)
lea 11(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_high_8):
test $0x10, %dh
jnz L(strnlen_exit_tail12)
test $0x20, %dh
jnz L(strnlen_exit_tail13)
test $0x40, %dh
jnz L(strnlen_exit_tail14)
sub $16, %edi
jb L(return_start_len)
lea 15(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail1):
sub $2, %edi
jb L(return_start_len)
lea 1(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail2):
sub $3, %edi
jb L(return_start_len)
lea 2(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail4):
sub $5, %edi
jb L(return_start_len)
lea 4(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail5):
sub $6, %edi
jb L(return_start_len)
lea 5(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail6):
sub $7, %edi
jb L(return_start_len)
lea 6(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail8):
sub $9, %edi
jb L(return_start_len)
lea 8(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail9):
sub $10, %edi
jb L(return_start_len)
lea 9(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail10):
sub $11, %edi
jb L(return_start_len)
lea 10(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail12):
sub $13, %edi
jb L(return_start_len)
lea 12(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail13):
sub $14, %edi
jb L(return_start_len)
lea 13(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail14):
sub $15, %edi
jb L(return_start_len)
lea 14(%eax), %eax
RETURN

#ifndef USE_AS_STRLCAT
.p2align 4
L(return_start_len):
movl LEN(%esp), %eax
RETURN
#endif

/* for prolog only */

.p2align 4
L(len_less4_prolog):
xor %eax, %eax

add $4, %edi
jz L(exit_tail0)

cmpb $0, (%edx)
jz L(exit_tail0)
cmp $1, %edi
je L(exit_tail1)

cmpb $0, 1(%edx)
jz L(exit_tail1)
cmp $2, %edi
je L(exit_tail2)

cmpb $0, 2(%edx)
jz L(exit_tail2)
cmp $3, %edi
je L(exit_tail3)

cmpb $0, 3(%edx)
jz L(exit_tail3)
mov %edi, %eax
RETURN

.p2align 4
L(len_less8_prolog):
add $4, %edi

cmpb $0, 4(%edx)
jz L(exit_tail4)
cmp $1, %edi
je L(exit_tail5)

cmpb $0, 5(%edx)
jz L(exit_tail5)
cmp $2, %edi
je L(exit_tail6)

cmpb $0, 6(%edx)
jz L(exit_tail6)
cmp $3, %edi
je L(exit_tail7)

cmpb $0, 7(%edx)
jz L(exit_tail7)
mov $8, %eax
RETURN


.p2align 4
L(len_less12_prolog):
add $4, %edi

cmpb $0, 8(%edx)
jz L(exit_tail8)
cmp $1, %edi
je L(exit_tail9)

cmpb $0, 9(%edx)
jz L(exit_tail9)
cmp $2, %edi
je L(exit_tail10)

cmpb $0, 10(%edx)
jz L(exit_tail10)
cmp $3, %edi
je L(exit_tail11)

cmpb $0, 11(%edx)
jz L(exit_tail11)
mov $12, %eax
RETURN

.p2align 4
L(len_less16_prolog):
add $4, %edi

cmpb $0, 12(%edx)
jz L(exit_tail12)
cmp $1, %edi
je L(exit_tail13)

cmpb $0, 13(%edx)
jz L(exit_tail13)
cmp $2, %edi
je L(exit_tail14)

cmpb $0, 14(%edx)
jz L(exit_tail14)
cmp $3, %edi
je L(exit_tail15)

cmpb $0, 15(%edx)
jz L(exit_tail15)
mov $16, %eax
RETURN
#endif

.p2align 4
L(exit_tail1):
add $1, %eax
RETURN

L(exit_tail2):
add $2, %eax
RETURN

L(exit_tail3):
add $3, %eax
RETURN

L(exit_tail4):
add $4, %eax
RETURN

L(exit_tail5):
add $5, %eax
RETURN

L(exit_tail6):
add $6, %eax
RETURN

L(exit_tail7):
add $7, %eax
RETURN

L(exit_tail8):
add $8, %eax
RETURN

L(exit_tail9):
add $9, %eax
RETURN

L(exit_tail10):
add $10, %eax
RETURN

L(exit_tail11):
add $11, %eax
RETURN

L(exit_tail12):
add $12, %eax
RETURN

L(exit_tail13):
add $13, %eax
RETURN

L(exit_tail14):
add $14, %eax
RETURN

L(exit_tail15):
add $15, %eax
#ifndef USE_AS_STRCAT
RETURN
END (STRLEN)
#endif

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Faster Memcopy ...
« Reply #68 on: March 23, 2015, 07:08:14 AM »
Hi nidud,

Here is the Intel version of strlen from 2011:

interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Gunther
Get your facts first, and then you can distort them.

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #69 on: March 23, 2015, 08:55:27 AM »
interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Not hard, but tedious! If someone wants to do it and post the results that would be appreciated. There are lots of defines like L(label) which (AFAIK) must be made into function MACROs - or else replace all occurrences by hand. There are the "." labels to fix. There's the multiline statements, with ;'s separating, to deal with; and of course reverse the order of about 400 commands. The #'s, %'s and the $'s are easy. Normally this routine would be about 20 or 30 lines (in fact that's the whole point, that he's unrolled everything to the max) and the job is trivial. OTOH if there were a few thousand lines, it would be rather fun to make .bat files and MACROs to automate the conversion. But for 700 lines not really worth the effort. Tedious. Oh, and don't forget to fix that egregious spelling mistake "calee".

Brings up a question, why is it AT&T syntax? Obviously this routine is not for Windows. Note that he uses xmm0-xmm3 then skips to xmm6, in Windows you'd use 4 and 5 wouldn't you? Reason it might matter, what is the target (type of) processor? If it's for handhelds it may not be optimal for us, those processors might not have good branch prediction, for instance, or other differences. I don't know, u understand, but he has some rather surprising techniques. Like, avoids branches like the plague; avoids bsf, prefers pminub to pcmepqb, other little things. These might be great techniques for us, or, as I wonder, might be designed for different processors (say, Atom) or environments. It also seems optimized for very small strings.

And, it doesn't seem very professional. All these lines, and he still doesn't have a MAXLEN - if the string isn't terminated the routine will dive right off the deep end. And why is Intel making it public anyway? Is this really their best effort? I'll bet it's not what Windows is using.

Anyway, it would be nice if someone would translate it, I'll be happy to test it, wouldn't be amazed if it's not all that impressive speedwise - in our typical PC environment.

[edit] Of course I'm pretty ignorant about these issues, don't mean to pretend I know what I'm talking about. Just some thoughts/questions that occurred to me
I am NaN ;)

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #70 on: March 23, 2015, 11:04:49 PM »
Well, it was just to illustrate the point about code size, and yes this code is made for the Atom processor.

This one is made for the Silvermont in 2014. This is more compact and similar to the ones we worked with.

Code: [Select]
/*
Copyright (c) 2014, Intel Corporation
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice,
    * this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice,
    * this list of conditions and the following disclaimer in the documentation
    * and/or other materials provided with the distribution.

    * Neither the name of Intel Corporation nor the names of its contributors
    * may be used to endorse or promote products derived from this software
    * without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifndef STRLEN
# define STRLEN strlen
#endif

#ifndef L
# define L(label) .L##label
#endif

#ifndef cfi_startproc
# define cfi_startproc .cfi_startproc
#endif

#ifndef cfi_endproc
# define cfi_endproc .cfi_endproc
#endif

#ifndef cfi_rel_offset
# define cfi_rel_offset(reg, off) .cfi_rel_offset reg, off
#endif

#ifndef cfi_restore
# define cfi_restore(reg) .cfi_restore reg
#endif

#ifndef cfi_adjust_cfa_offset
# define cfi_adjust_cfa_offset(off) .cfi_adjust_cfa_offset off
#endif

#ifndef ENTRY
# define ENTRY(name)             \
.type name,  @function;  \
.globl name;             \
.p2align 4;              \
name:                            \
cfi_startproc
#endif

#ifndef END
# define END(name)               \
cfi_endproc;             \
.size name, .-name
#endif

#define CFI_PUSH(REG)                   \
cfi_adjust_cfa_offset (4);      \
cfi_rel_offset (REG, 0)

#define CFI_POP(REG)                    \
cfi_adjust_cfa_offset (-4);     \
cfi_restore (REG)

#define PUSH(REG) pushl REG; CFI_PUSH (REG)
#define POP(REG) popl REG; CFI_POP (REG)

.section .text.sse2,"ax",@progbits
ENTRY (STRLEN)
mov 4(%esp), %edx
mov %edx, %ecx
and $0x3f, %ecx
pxor %xmm0, %xmm0
cmp $0x30, %ecx
ja L(next)
movdqu (%edx), %xmm1
pcmpeqb %xmm1, %xmm0
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit_less16)
mov %edx, %eax
and $-16, %eax
jmp L(align16_start)
L(next):
mov %edx, %eax
and $-16, %eax
PUSH (%edi)
pcmpeqb (%eax), %xmm0
mov $-1, %edi
sub %eax, %ecx
shl %cl, %edi
pmovmskb %xmm0, %ecx
and %edi, %ecx
POP (%edi)
jnz L(exit_unaligned)
pxor %xmm0, %xmm0
L(align16_start):
pxor %xmm1, %xmm1
pxor %xmm2, %xmm2
pxor %xmm3, %xmm3
pcmpeqb 16(%eax), %xmm0
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)


test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 80(%eax), %xmm0
add $80, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm1
add $16, %eax
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm2
add $16, %eax
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm3
add $16, %eax
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit)

add $16, %eax
.p2align 4
L(align64_loop):
movaps (%eax), %xmm4
pminub 16(%eax), %xmm4
movaps 32(%eax), %xmm5
pminub 48(%eax), %xmm5
add $64, %eax
pminub %xmm4, %xmm5
pcmpeqb %xmm0, %xmm5
pmovmskb %xmm5, %ecx
test %ecx, %ecx
jz L(align64_loop)


pcmpeqb -64(%eax), %xmm0
sub $80, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $64, %eax
ret

.p2align 4
L(exit):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
ret

L(exit_less16):
bsf %ecx, %eax
ret

.p2align 4
L(exit_unaligned):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
ret

.p2align 4
L(exit16):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $16, %eax
ret

.p2align 4
L(exit32):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $32, %eax
ret

.p2align 4
L(exit48):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $48, %eax
ret

.p2align 4
L(exit64):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $64, %eax
ret

END (STRLEN)

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Faster Memcopy ...
« Reply #71 on: March 24, 2015, 12:36:04 AM »
Hi nidud,

could you post the entire code as attachment, please? It's much easier for me to do the conversion. Thank you.

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #72 on: March 24, 2015, 01:13:21 AM »
I converted and tested both of them.
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- string length 0..40
862314    cycles - (  0) proc_0: crt_strlen
513683    cycles - (105) proc_1: SSE 32
653387    cycles - (104) proc_2: AxStrLenSSE
641895    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
561687    cycles - (673) proc_8: SSE Intel Silvermont
726553    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
940951    cycles - (  0) proc_0: crt_strlen
331639    cycles - (105) proc_1: SSE 32
474770    cycles - (104) proc_2: AxStrLenSSE
478938    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
383493    cycles - (673) proc_8: SSE Intel Silvermont
469439    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
464667    cycles - (  0) proc_0: crt_strlen
111167    cycles - (105) proc_1: SSE 32
162923    cycles - (104) proc_2: AxStrLenSSE
166028    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
105110    cycles - (673) proc_8: SSE Intel Silvermont
120176    cycles - (818) proc_9: SSE Intel Atom

result:
   956489 cycles - (105) proc_1: SSE 32
  1050290 cycles - (673) proc_8: SSE Intel Silvermont
  1286861 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  1291080 cycles - (104) proc_2: AxStrLenSSE
  1316168 cycles - (818) proc_9: SSE Intel Atom
  2267932 cycles - (  0) proc_0: crt_strlen

The Intel files are 8.asm and 9.asm in the archive

jj2007

  • Member
  • *****
  • Posts: 10545
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #73 on: March 24, 2015, 02:30:50 AM »
I must confess I don't understand what your testbed is showing. Can you explain?

-- string length 0..40
862314    cycles - (  0) proc_0: crt_strlen
513683    cycles - (105) proc_1: SSE 32
-- string length 40..80
940951    cycles - (  0) proc_0: crt_strlen
331639    cycles - (105) proc_1: SSE 32

- what are the numbers in brackets?
- why does SSE 32 run faster with the longer string?

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #74 on: March 24, 2015, 02:47:14 AM »
Quote
what are the numbers in brackets?

The size of the proc from reading the .BIN file.
The size of crt_strlen is unknown.

Quote
why does SSE 32 run faster with the longer string?

It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.