Antariy,
Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.
The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.
First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.
I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.
Still preserving XMM and ECX.
Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.
CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.
Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.
By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.
Here are the resulting runs:
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
BUFFER ALIGNED
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 906 854 852 905
31 1147 1020 1019 1074
271 4024 4142 4020 3924
2014 26115 24593 24595 25818
62159 757816 747523 748235 757502
BUFFER MISALIGNED src 11
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 1184 1157 1132 1241
31 1399 1391 1395 1425
271 4508 4368 4432 4522
2014 25622 25036 25018 25604
62159 757612 747909 746568 757986
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
BUFFER ALIGNED
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 2124 1551 1551 2319
31 2526 1944 1926 2494
271 6220 5679 5676 6416
2014 29950 30171 30869 30104
62159 872547 886154 887221 871530
BUFFER MISALIGNED src 11
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 2776 2320 2319 2622
31 2795 2420 2418 2797
271 6016 5461 5476 6055
2014 30809 31229 31080 30842
62159 887148 887519 888207 889350
Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.
Here is my modified routine:
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
mov eax,[esp+4]
add esp,-14h
movups [esp],xmm7
mov edx, eax
mov [esp+16],ecx
and edx,-10h
xorps xmm7,xmm7
mov ecx,eax
and ecx,0fh
jz intoloop
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
xorps xmm7,xmm7
@@: ; naturally aligned to 16
add edx,16
intoloop:
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
add eax, edx
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
algo2 endp
end_algo2:
Bottom line, I can't believe it's right to waste those 18 bytes.
Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)
Zip has "testStrLen.asm" test program.