Hi rrr :t
I don't ignore your messages, just was too tired to answer thorougly.
The one note right now - yes, AMDs look very slow with BSF, though that instuction is executed not in the inner loop so it has noticeable influence only with short strings, but you may speed it up on the AMD just using BSF AX,AX instead of EAX,EAX. On Intel that has no much difference (it's a bit slower), but for AMD makes sense. With the one-reg using algo which provides 16 bit mask the usage of 16 bit BSF on AMD is a right choice.
nidud, here is a tweak of your algo which in this form comes close to the algo posted on the "Research" thread on my machine:
strlen proc STDCALL lpszStr:DWORD
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
add edx,ecx
jbe loop_32
shl ecx,16
pmovmskb edx,xmm0
add ecx,edx
done:
bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen endp
Actually I think that if you will use an entry like in mine StrLen (just shift right overhead bits) and this inner loop (that is strange, but the movdqa/pcmpeqb sequence seems faster than just direct pcmpeqb) and the checking with exit from the loop as I suggested, this code will be probably the one of the best solutions for StrLen on the 32 bit machines. The crazy-unrolled Intel's solution has no any gain in real world.
Probably the rrr's unrolling is the maximum which has sense, but it's strange to look on the algos which are 600+ or even 800+ bytes long, just to check the string length on the, after all, real world machine in the real world app.