Author Topic: Faster Memcopy ...  (Read 32894 times)

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #30 on: March 11, 2015, 12:55:50 AM »
Code: [Select]
1) test lower bits of address register for desired alignment
   branch if aligned

consider this:
Code: [Select]
upcode        size AMD   Intel
JMP        [2]  30.5   8.1
JNZ        [2]  30.1  10.7 ; jump
JZ        [2]   1.0   8.0 ; no jump

jumping around is in worst case 30 cycles

the jump situation:
Code: [Select]
mov edi,eax ; 1.0
test edi, 15 ; 1.0
jz aligned ; 1.0/30.0
neg edi ; 2.5
and edi,16-1; 2.5
add esi,edi ; 2.5
sub ecx,edi ; 2.5
add edi,eax ; 2.5
unaligned: ; 15.0
aligned: ; 32.0

the no jump situation:
Code: [Select]
mov edi,eax ; 1.0
neg edi ; 2.5
and edi,16-1; 2.5
add esi,edi ; 2.5
sub ecx,edi ; 2.5
add edi,eax ; 2.5
justdoit: ; 13.5

dedndave

  • Member
  • *****
  • Posts: 8823
  • Still using Abacus 2.0
    • DednDave
Re: Faster Memcopy ...
« Reply #31 on: March 11, 2015, 02:37:18 AM »
while that's true, to some extent

you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

you might also try a jump table - haven't tried that, but it might be faster

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #32 on: March 11, 2015, 03:37:13 AM »
you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

I think I just did  :lol:

If you take care of the aligned start and tail bytes, which is only 4 moves, it doesn't matter if the pointer is aligned or not. If it is aligned EDI is 0 and nothing change. However, you need a minimum of align-size count to do this, so one conditional branch is needed.

Quote
if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

Haven't thought about that, but that seems to be the case:  :P
Code: [Select]
align 16
loop_L:
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jnz loop_L
...
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R

Quote
you might also try a jump table - haven't tried that, but it might be faster

it’s painfully slow - crt_memcpy uses that
but then again, no need to jump around  :biggrin:

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #33 on: March 11, 2015, 06:11:08 AM »
Hi nidud,

This is a very interesting investigation, unfortunately I don't have time to get into it at the moment, but in a couple days I'd really like to do so. The following comments are (deliberately) w/o reference to anything but my organic memory subsystem, ...

First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

Now, your jump timings are right in effect - that is, the conclusion drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which. The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

I'm not kidding about a couple days: happy to investigate this further with you when I have more time.
I am NaN ;)

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #34 on: March 12, 2015, 06:20:59 AM »
Quote
First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

I notice that EBX pups up randomly in your code but EDX is seldom used  :biggrin:

Quote
Now, your jump timings are right in effect - that is, the conclusion
drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

True. The jump estimate is a worst-case value and depends on many things as you say. The timings are based on a short jump of 2 bytes and calibrated from a MOV instruction (1). Using the current test bed to illustrate this by adjusting the loop count:
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
----------------------------------------------
    96030 cycles - proc_1: test eax,eax
   100605 cycles - proc_0: mov eax,1
   164682 cycles - proc_9: movdqu xmm0,[eax]
   271729 cycles - proc_8: dec eax
   272095 cycles - proc_7: neg eax
   285086 cycles - proc_5: jz  @F no jump
   570121 cycles - proc_6: jnz @F +7
   570438 cycles - proc_4: jmp @F +5
   574778 cycles - proc_3: jmp @F +3
   593173 cycles - proc_2: jmp @F +2

AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
   100283 cycles - proc_1: test eax,eax
   100287 cycles - proc_0: mov eax,1
   154779 cycles - proc_5: jz  @F no jump
   176163 cycles - proc_9: movdqu xmm0,[eax]
   271020 cycles - proc_8: dec eax
   274000 cycles - proc_7: neg eax
   569227 cycles - proc_6: jnz @F +7
   928449 cycles - proc_4: jmp @F +5
  2570589 cycles - proc_3: jmp @F +3
  2941084 cycles - proc_2: jmp @F +2

The AMD/Intel timing will then be:
Code: [Select]
jz  <>  ;  1.6  2.9
jnz +7  ;  5.7  5.7
jmp +5  ;  9.3  5.7
jmp +3  ; 25.7  5.7
jmp +2  ; 29.4  5.9

AMD have a short test (1.6) in this case but a long jump (25.7):
Code: [Select]
.if eax
    inc ecx
.endif ; 1.6+2.7/25.7 (AMD) cycles
...
.if eax
    add ecx,1
.endif ; 1.6+2.7/ 9.3 (AMD) cycles

Quote
I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which.

This uses EAX instead of ESP to save one byte. This has to do with code size and alignment. This is important for speed but many systems are less picky in this regard. However, most new machines (and almost all Intel CPU’s) works best on aligned loops. The ideal format would look something like this:
Code: [Select]
00000000:
; init code in para
00000040:
; copy loop
ret
align 16
00000080:
; handle short lengths and overlapped data
ret

There is also "real" and "fake" alignment in this case. The latter adds cycles in form of code like NOP, LEA EAX,[EAX], and LEA ESP,[ESP] and so on to add up bytes to a specific offset. So it is better to keep the alignment as real as possible.

In the first algo I needed to extend the code and used DWORD PTR -16 to add 3 byte:
Code: [Select]
00000038  03F8 add edi,eax
0000003A  81E1F0FFFFFF and ecx,dword ptr -16 ; align EIP and ECX 16
00000040 align 16
00000040 loop_L:

if 3 byte is added above this happen:
Code: [Select]
0000003B  03F8 add edi,eax
0000003D  81E1F0FFFFFF and ecx,dword ptr -16 ; align EIP and ECX 16
00000043  8DA424000000008D80 align 16
00000050 loop_L:

you may then compensate by removing the DWORD PTR extension:
Code: [Select]
0000003B  03F8 add edi,eax
0000003D  83E1F0 and ecx,-16 ; align EIP and ECX 16
00000040 align 16
00000040 loop_L:

Quote
The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

If you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that’s the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.
Code: [Select]
--------- un/aligned: 262159

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
  1778875 cycles - proc_7: SSE - 128 byte mem2(rrr)
  1782810 cycles - proc_8: SSE - 128 byte *
  1783226 cycles - proc_6: SSE - 128 byte modified
  1787929 cycles - proc_3: SSE -  16 byte *
  1789860 cycles - proc_4: SSE -  64 byte *
  1793160 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  1875893 cycles - proc_0: ??? crt_memcpy *
  3195546 cycles - proc_2: 386 -   4 byte *
 10433134 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
  2562789 cycles - proc_8: SSE - 128 byte *
  2573278 cycles - proc_4: SSE -  64 byte *
  2647475 cycles - proc_3: SSE -  16 byte *
  2721439 cycles - proc_6: SSE - 128 byte modified
  2732438 cycles - proc_7: SSE - 128 byte mem2(rrr)
  2749617 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  3495427 cycles - proc_0: ??? crt_memcpy *
  3610008 cycles - proc_2: 386 -   4 byte *
 10119189 cycles - proc_1: AVX -   1 byte *

Quote
My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

It’s difficult to say given they seem not so different in these tests.

Code: [Select]
--------- short: 2014

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
   141012 cycles - proc_8: SSE - 128 byte *
   141209 cycles - proc_4: SSE -  64 byte *
   165921 cycles - proc_6: SSE - 128 byte modified
   166576 cycles - proc_7: SSE - 128 byte mem2(rrr)
   194728 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
   278250 cycles - proc_3: SSE -  16 byte *
   480782 cycles - proc_0: ??? crt_memcpy *
   780798 cycles - proc_2: 386 -   4 byte *
  2246478 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
   212251 cycles - proc_4: SSE -  64 byte *
   228560 cycles - proc_8: SSE - 128 byte *
   251544 cycles - proc_7: SSE - 128 byte mem2(rrr)
   253457 cycles - proc_6: SSE - 128 byte modified
   267874 cycles - proc_3: SSE -  16 byte *
   293842 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
   559291 cycles - proc_0: ??? crt_memcpy *
   756845 cycles - proc_2: 386 -   4 byte *
  1987521 cycles - proc_1: AVX -   1 byte *

This is the copy algo for 0..31 byte
Code: [Select]
--------- short: 1..15

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
  3889571 cycles - proc_5: SSE -  16 byte *
  4850792 cycles - proc_2: 386 -   4 byte *
  4887667 cycles - proc_4: SSE - 128 byte mem2(rrr)
  5897787 cycles - proc_0: ??? crt_memcpy *
  8191556 cycles - proc_3: SSE - 128 byte xmemcpy(rrr)
 21950067 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
  4158723 cycles - proc_5: SSE -  16 byte *
  4572560 cycles - proc_4: SSE - 128 byte mem2(rrr)
  5055317 cycles - proc_2: 386 -   4 byte *
  6617426 cycles - proc_0: ??? crt_memcpy *
  9515503 cycles - proc_1: AVX -   1 byte *
 10004988 cycles - proc_3: SSE - 128 byte xmemcpy(rrr)

I think the copying of the short lengths is a bit faster, but that code is only used if the size < 32. The main difference on larger size is the lack of head and tail-byte copying which is done in 4 moves.

Code: [Select]
mov xmm0,[esi]
mov xmm1,[esi+ecx-16]
mov [eax], xmm0
mov [eax+ecx-16], xmm1

I included the tests made in an archive if you interested. I mostly use command line tools and makefiles for the tests, so I included some binaries.

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #35 on: March 12, 2015, 08:24:13 AM »
@nidud, Thanks for providing your test software, in future I'll use some of those (better) techniques.

But I think I've answered my own question, why my algo does a bit better when timed my way instead of yours. I'm using the standard approach: not loading .bin algos or FlushInstructionCache 'ing. (These are good things to do, thanks for the tip) But doing it the usual way, when the dest is aligned the branch (around the aligning instructions) is hit many thousands of times, and *always* taken. So, the processor predicts the branch perfectly. That jump costs almost nothing, couple cycles (not sure exactly). Whereas you're executing six instructions there (13.5 cycles). This matches my results: my algo's a bit ahead on ALIGNED data, but a bit behind on non-aligned, since not taking the jump costs a little. Conclusion:

- rrr's rule for not-very-careful timing test beds: Conditional branches (that depend in a simple way on the inputs) are almost free, so use as many as u want!

Reminds me of something hutch said, "I have always been an advocate of real time testing ...". In real life the environments will be very different, and variable, you don't know what you'll run into, so:

- rrr's timing statistical rule of thumb: ignore differences smaller than about 5%.

I think you should update your scoring routine to allow ties!

A more important thought re. branch prediction occurred to me. When a utility routine like momory-copy is called in real life, there's a decent chance the branch to it wasn't predicted right, so possibly the pipeline is going to be empty (other factors may cause the same thing). So, always put a (necessary) branch right at the beginning if possible, where a misprediction might be almost free.
 
Quote from: nidud
However, most new machines (and almost all Intel CPU’s) works best on aligned loops.

I'm sure you're right, but I saw approximately no timing difference with "align x". At least in one important case, it turned out the loop was (accidentally) already aligned. But even when not, no real difference: dunno why.

Quote from: nidud
If you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that’s the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.

Your first sentence is true when called with large blocks. But your second sentence explains why the code above (or below) the loop label *can be* important; your user may indeed be using it for lots of small lengths. That's why if one can save a cycle or two (for instance, by the placement of "toend" code) it should be saved in favor of small lengths vs. long lengths.

Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

c u later,
I am NaN ;)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6758
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #36 on: March 12, 2015, 08:45:05 AM »
I have also learnt a few lessons on code alignment, regularly while bashing the guts out of an algo to get its pace up, I have tried aligning labels that get hit hard and seen the timing take a lot longer, much the same with procedure alignment, align at 4 bytes and it works OK, change it to 8 or 16 bytes and it often slows right down. Then its worth shifting the location of the algo around in the exe to make sure you are not deluding yourself and see what the results are.

I have tried spacing algos from each other with padding from under 1k to over 1 megabyte that was more or less useless so I usually settle of a delay between each algo of about 500 ms and set the priority class to high but not the one higher. As hardware gets smarter, timings wander even further.  :biggrin:
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #37 on: March 12, 2015, 09:07:07 AM »
Quote from: hutch
As hardware gets smarter, timings wander even further.

Very true, meanwhile I'm getting slower ... but, I'm reminded of something hutch once said,

Quote from: hutch
Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.

Also true! One good thing, it gives the lie to the C programmer's favorite saying, "the compiler does it better than you". Unless the compiler writers are down in the trenches timing away - as opposed to just believing Intel/AMD/Agner, as I think they do - they can't match a real MASMologist
I am NaN ;)

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #38 on: March 12, 2015, 09:22:35 AM »
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

 :biggrin:

I was thinking the opposite since the first version also copy the last bytes.
Code: [Select]
movdqu  xmm2,[esi]
movdqu  xmm3,[esi+ecx-16]
I moved this below the jump not thinking ahead  :P

well, good catch  :t

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #39 on: March 19, 2015, 02:39:37 AM »
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

The read-ahead technique seems to be widely used. Here the Intel version of STRLEN:

Code: [Select]
        add     eax,dword ptr 0         ; 5 byte nop to align label below
        align   16                      ; should be redundant
main_loop:
        mov     eax,dword ptr [ecx]     ; read 4 bytes

This version then +3, the 64-bit version +7, and the SIMD version +15 bytes. The version I currently using reads ahead 32 bytes:
Code: [Select]
@@:
movdqu  xmm1,[ecx+16] ; compare 32 byte
movdqu  xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx ; using 32 bit
jz @B

The only safe way will then be a byte scan. In the MEM functions it will be possible to avoid this issue given you have the actual size of the buffer, but in the STR functions you need to scan.

The benefit of using it:
Code: [Select]
2415815   cycles - (  0) 0: crt_strlen
851439   cycles - ( 58) 5: SSE

330516   cycles - (  0) 0: crt_strchr
43257   cycles - ( 85) 3: SSE

1112057   cycles - (  0) 0: crt_strcat
353820    cycles - ( 90) 3: SSE

400591   cycles - (  0) 0: crt_strstr
41779   cycles - (125) 3: SSE

There is however a potential problem with this method.

jj2007

  • Member
  • *****
  • Posts: 9795
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #40 on: March 19, 2015, 03:22:09 AM »

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #41 on: March 19, 2015, 05:26:14 AM »
Yes, there you go.

I do control memory allocation where these functions are used, and there is a 16-byte header at the end of the heap. Reading 32 bytes may however overlap.

Stack and Data shouldn't be a problem in this case, but then again, the source may be the clipboard or other externals so you don't really know.

Antariy

  • Member
  • ****
  • Posts: 551
Re: Faster Memcopy ...
« Reply #42 on: March 19, 2015, 06:21:32 AM »
You may safely read/write buffers not in byte-by-byte fashion.

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.

What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now :greensml:

Antariy

  • Member
  • ****
  • Posts: 551
Re: Faster Memcopy ...
« Reply #43 on: March 19, 2015, 06:41:04 AM »
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range. The full sense of read-ahead is only when the code does read only with not-dword-aligned strings, but taking in attention that the code might probably do the alignment (which is not obvious from the small piece of code provided) and/or that that every "advanced" compiler does string alignment on the dword boundary - this is not a dangeround "read-aheading" at all - it has nothing common with the algo which does two (or more) reads before it checks the condition of termination. If the algo with use of any technique is still in the range of taking notice to be in the page range, the algo will not crash. All other differences - are the details of implementation but not the idea itself.

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #44 on: March 19, 2015, 09:15:28 AM »
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc

None of these (or any other) functions are safe from that point of view.

The only valid test will be the case of a zero at the end of the buffer. This means that the pointer must be byte-aligned at the top by testing each byte. The rest of the scan may then read chunks of aligned size blocks safely.

Quote
As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes.

Given the pointer points to the last byte the size of the buffer is 1 byte, and in this case you read 16 byte.

Quote
What is wrong with that example of SSE code which does reading with two XMM regs:

The same thing that is wrong with the 4, 8, and 16 byte overlapping reads. If the pointer is not aligned before the scan it will fail regardless of size (16/32).