OK. I added an old algo of mine above, a bit bloated but the timings look competitive ;-)
EDIT: The attached new version yields some surprising results... ::)
Inter alia, in round 2 ("contains TXFS_") all algos return the correct result, but InString and the two Boyer-Moore variants need a lot of time for doing so... a factor 25 for Masm32 InString vs CRT strstr!
I have no clue why that is the case. Any ideas, or different results?
The comment * -= is Windows.inc and WinExtra.inc combined - more than 2 MB to mitigate cache effects.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
Testing if [comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=...] contains TXFS_RM_FLAG_DO_NOT_RESET_RM_AT_NE
XT_START
31716 kCycles for 3 * MB Instr_
46678 kCycles for 3 * InString
45015 kCycles for 3 * crt_strstr
15563 kCycles for 3 * BMBinSearch
15980 kCycles for 3 * BMHBinsearch
12553 kCycles for 3 * InstrJJ
31322 kCycles for 3 * MB Instr_
46642 kCycles for 3 * InString
45696 kCycles for 3 * crt_strstr
15170 kCycles for 3 * BMBinSearch
15942 kCycles for 3 * BMHBinsearch
12535 kCycles for 3 * InstrJJ
31228 kCycles for 3 * MB Instr_
46267 kCycles for 3 * InString
45374 kCycles for 3 * crt_strstr
15143 kCycles for 3 * BMBinSearch
16418 kCycles for 3 * BMHBinsearch
12569 kCycles for 3 * InstrJJ
2026935 = eax MB Instr_
2026935 = eax InString
2026935 = eax crt_strstr
2026935 = eax BMBinSearch
2026935 = eax BMHBinsearch
2026935 = eax InstrJJ
Testing if [comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=...] contains TXFS_
251 kCycles for 3 * MB Instr_
9125 kCycles for 3 * InString
372 kCycles for 3 * crt_strstr
6391 kCycles for 3 * BMBinSearch
6416 kCycles for 3 * BMHBinsearch
74 kCycles for 3 * InstrJJ
251 kCycles for 3 * MB Instr_
8678 kCycles for 3 * InString
776 kCycles for 3 * crt_strstr
6404 kCycles for 3 * BMBinSearch
6408 kCycles for 3 * BMHBinsearch
74 kCycles for 3 * InstrJJ
252 kCycles for 3 * MB Instr_
8648 kCycles for 3 * InString
373 kCycles for 3 * crt_strstr
6412 kCycles for 3 * BMBinSearch
6483 kCycles for 3 * BMHBinsearch
76 kCycles for 3 * InstrJJ
18057 = eax MB Instr_
18057 = eax InString
18057 = eax crt_strstr
18057 = eax BMBinSearch
18057 = eax BMHBinsearch
18057 = eax InstrJJ
Testing if [This is a simple string which has toward...] contains Dupli
3497 kCycles for 10000 * MB Instr_
4322 kCycles for 10000 * InString
3980 kCycles for 10000 * crt_strstr
4953 kCycles for 10000 * BMBinSearch
5665 kCycles for 10000 * BMHBinsearch
1030 kCycles for 10000 * InstrJJ
3500 kCycles for 10000 * MB Instr_
4328 kCycles for 10000 * InString
3986 kCycles for 10000 * crt_strstr
5050 kCycles for 10000 * BMBinSearch
5379 kCycles for 10000 * BMHBinsearch
1033 kCycles for 10000 * InstrJJ
3501 kCycles for 10000 * MB Instr_
4323 kCycles for 10000 * InString
3980 kCycles for 10000 * crt_strstr
4949 kCycles for 10000 * BMBinSearch
5529 kCycles for 10000 * BMHBinsearch
1036 kCycles for 10000 * InstrJJ
75 = eax MB Instr_
75 = eax InString
75 = eax crt_strstr
0 = eax BMBinSearch
75 = eax BMHBinsearch
75 = eax InstrJJ
Same phenomenon on my Celeron M:
Testing if [comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=...] contains TXFS_
245 kCycles for 3 * MB Instr_
8814 kCycles for 3 * InString
237 kCycles for 3 * crt_strstr
4764 kCycles for 3 * BMBinSearch
4783 kCycles for 3 * BMHBinsearch
58 kCycles for 3 * InstrJJ
EDIT(2): Here is the culprit for InString (\Masm32\m32lib\instring.asm, line 45):
if 0
mov sLen, Len(lpSource) ; 4830 cycles for the TXFS_ case (MasmBasic Len is a bit faster than StrLen...)
else
invoke StrLen,lpSource ; 8690 cycles for the TXFS_ case
mov sLen, eax ; source length
endif
The other algos don't check the source len beforehand, and therefore are much faster if the source string is long and there is an early match.