rrr,
Antariy,
Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.
Well, my feelings about those topics are just IMO. "Protected programming" is not a panacea, since the program after all runs on not perfect OS, which in its turn has its own bugs and problems. As for Lingo - I do not have any feelings about that person at all, well, that's what, after all, he deserved in my opinion: the "name" not of a "great coder" but an immature and uncultured person with exaggerated selfopinion.
As for "topic paster", well, just cleared that out so you didn't think that that was written not as sligthing word.
The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.
Well, those bytes on some machines may be "wasted" on some may be "useful" - that has the only difference where it will run. This is something similar to the other code which was actively discussed here - the hex2dword algo. There my initial code was sloppy, but it implemented an idea of unified processing of the hex string - with no difference to the case of the letters - using one path without checking and correction of the letters to be some "hardcoded" case (upper or lower). It worked well, but with statical tests (just like we used with the StrLen here) it was slow and I was drawn into discussion which took some days and many tries to make the algo work "faster" on modern CPUs with the "statical" test. That was hard thing to do because all my "optimizations" I was done - didn't make the algo any faster on my own CPU, so I was optimizing it just "by touch" and analisis. Well, after all the algo became "the fastest" (it was one cycle faster than the previous "winner" on Core2+ CPUs

And it was ~2 times faster on my CPU - but it was also faster than the previous "winner" before optimisations, so that was done just for "sake of the challenge" ::)) algo on every (!) CPU it was tested there - but that cost much of time and tries, with, actually, no reason, because the dumb unrolled code will always be "faster". Well, I was testing that here for probably first time from registration on the forum, and after finding that the unrolled algos will for sure be faster even in real world conditions, I chose to use it instead of short looped algo.
But, for what I telling this you? Because the situation is very similar now. And because that my algo which was the "winner" between looped algos in that hex2dword challenge, also did use a lot of those "wasted bytes" to do its task, and it did that. The algo was the fastest only when used in a "bloated" manner, when you make the instructions smaller and instead of pre-checking the starting condition before loop, just jump into the loop, the algo was not "winner" - it was slower by couple of cycles but that was important there because all the discussion there was in the "who are the fastest?" manner. So the code was bigger than it might to be for, probably, 30% there, just because of the "padding long instructions".
So, if I want to take the most from some piece of code, especially if that's the short-cuts like code entry and some prechecks before going into the inner loop, I do not hesistate to use longer instructions because this may save some cycles when the code does a short-cut and the timings are small in any case with any other algo, so those saved cycles maybe useful to "show" the better timings ::) Well, yes, as you may notice, the timings become a an "idol" in these tests. People look at them, respect them, compare and align theyselves code with that, so, aligning to the code timings is, probably, not the best thing because it, after all, leads to the patterned code design. The interesting point is: my hex2dword looped code was not pattern-like type in its idea, and, though initially the idea was implemented "slowly", after the "contest" it was the fastest - though it was still the same code with the same idea, and, more over, it works on any 32 bit CPU starting from i386, at the same time the previous "winner" was only P6+ compatible code (due to CMOV). The better idea, which avoids the conditional processing (jumps or movements) at all, just won the patterned idea at speed and the compatibility with the CPUs it may run on. But how much time it took to crazly-optimize the code, make it more long, bloated (still, we talk about less than 80 bytes), to win the "contest"? It's useless thing to do in an "everyday manner" - just if you feel that mood then you may want to do that. That's why I just used the longer instructions to align the loop with not too much other tests with other instructions lengths - because on every CPU it will run differently, a bit faster on one and a bit slower on the other, and after all it will run on the real world conditions where all algos will go "head to head" even being crazly complex and sophisticated. I just wrote the code with good idea in it - you may notice than my 16 bytes align on the code entry is even smaller and simpler than the Intel's one in the Silvermont code. That was the first step in the algo, and it was simple and elegant, the next step was then simple processing in the aligned inner loop. With the aligned data. Though the loop itself may be for some opinions aligned not very elegant with all those bytes, but the data alignment is elegant, so the code is elegant itself. I agree that it may look not "aestetically", but look at it as on a simple code which does its job well. Yes, it may be modified and tweaked in a bunch of a ways to run better on some specific CPU, but, as I said, it was written to perform "well" on every CPU - not to perform "the best" on some family or even model of CPU. And it's simple, yes, maybe a bit "sloppy", yes, it may be optimized to be smaller/faster, but that was not the aim since the reasons written in the "Research" thread.
First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.
I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.
Still preserving XMM and ECX.
Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.
CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.
Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.
By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.
Well, that probably will avoid the stall, yes, depending on how many execution units are in the CPU core and how deep is the further independentness of those instructions. I.e. - moving down the mov (the pun), you then moved it closer to the next and of that reg which you moved into, so, removing the stall above you then provoking it below - since the entry code is very compact and just has no additional space to spread instructions over.
Also you may use movaps/movups instead of movdqa/movdqu - it's the same 1 byte shorter. Just a note, because in the code it's already used movups.
Yes, I agree that if the routine is smaller and stable faster than the previous version on every CPU, then it's better to use the shortest version. As I said above, I was just not playing with that much.
CPU brances were always "predicted" as "not taken" for the jump further and as "taken" in the jump back, and this logic will probably not change in any CPU because this is simple logic of the loops - you do the work, then you check the condition, then you jump back to the start of the loop - this is more efficient in the speed meaning, yes, it's inefficient in the code size meaning, you need first do the check for the condition before you enter into the loop, but that was probably considered as a deal to the more speedy exit from the loops and more speed work of the loops - the CPU "sees" than the jump (especially conditional) is taken to the "back", so it "knows" that this is the loop and that it is almost always more probably that the jump will be taken than that it will not be taken - because loops are always executed more than once, they may execute thousands or millions of times, so, if the CPU will each iteration "mispredict" the loop jump, it will cost very much cycles. So, the CPU just treats the conditional jump back as "almost unconditional jump" - and only one misprediction may occur while the loop ends - but that's one small stall after thousands of properly jumped jumps (the pun again). So, the conditional jump back, when it's taken by the logic, costs almost nothing for the timings. But if the jump is untaked (final of the iterations) - it will have more clocks since it's "mispredicted". The same with the jumps further - the CPU "thinks" than it will not be taken, and if the jump further is taken actually - this costs some clocks because it's "misprediction". If the jump further is not taken in the code run - then it costs almost nothing to execute than Jcc instruction. And you may notice this technique is widely used, for an instance in the highly unrolled Intel's Silvermont algo, where the code has much conditional jumps further - those jumps cost almost nothing, and one of them, being jumped after all, will cost some cycles, but after than jump the code will finish its work, so that's not the matter for the deal. The code saves those cycles using the further jumps test, after execution of some those almost nothing cost jumps the code already "prepaid" the stall of the taken jump further, and taking in account it will finish/go closer to the finish after than this "misprediction" of jump further has no any to care about.
I think that pointer advancement before or after a fetch is a very disputable stuff due to every particular CPU family and even model's implementation - on some there's no slowdown (at least noticeable) when you put the advance before the fetch, on some other there is very noticeable slowdown, so, the matter of choise is that, probably.
Yes, it's hard to get reliable timings, that's why we use the testbed on the lab with million of the repeats to get more or less stable timings.
Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.
Here is my modified routine:
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
mov eax,[esp+4]
add esp,-14h
movups [esp],xmm7
mov edx, eax
mov [esp+16],ecx
and edx,-10h
xorps xmm7,xmm7
mov ecx,eax
and ecx,0fh
jz intoloop
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
xorps xmm7,xmm7
@@: ; naturally aligned to 16
add edx,16
intoloop:
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
add eax, edx
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
algo2 endp
end_algo2:
Bottom line, I can't believe it's right to waste those 18 bytes.
Well, as I said, if you can make it smaller and you see that it's reliably faster, then that thing is proper thing to do, I did not argue with that and do not try to prove that all those bytes MUST be there

Also you may avoid saving of xmm and ecx, this will save you 3-5 cycles too, which will be especially important with short strings - the case when the speed of algos is a matter of the speed of its entry and exit and not of inner loop speed. Here is another tweak of the algo, did not test it though. But in real world it will perform pretty well. Some rearangements in the data fetch and the entry code - since the XMM and ECX saving is avoided so there is no need to use the same reg XMM.
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo and then again modified by Antariy :) The chain of modifications :)
mov edx,[esp+4]
mov ecx,[esp+4]
and edx,-10h
xorps xmm7,xmm7
sub ecx,edx
;movaps xmm6,[edx]
db 0Fh,28h,72h,0
pcmpeqb xmm6,xmm7
pmovmskb eax,xmm6
shr eax,cl
jnz @ret
@@:
movaps xmm6,[edx]
pcmpeqb xmm6,xmm7
add edx,16
pmovmskb eax,xmm6
test eax,eax
jz @B
sub edx,[esp+4]
bsf ax,ax
add eax,edx
ret 4
@ret:
bsf ax,ax
ret 4
algo2 endp
It will be probably 68 bytes and the loop should be still aligned at 16 with only one byte of padding in longer "than required" instruction

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)
