Author Topic: Research on the algos design  (Read 11507 times)

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #45 on: March 29, 2015, 06:58:20 AM »
if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
Quote
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

This I have already answer, so it’s difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same…" may not be that much of a problem.

Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same…", which is simply not true.

Quote
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.

This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.

Antariy

  • Member
  • ****
  • Posts: 541
Re: Research on the algos design
« Reply #46 on: March 30, 2015, 08:19:03 AM »
if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
Quote
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

This I have already answer, so it’s difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box". "All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself). The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you? If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.


Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same…" may not be that much of a problem.

Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same…", which is simply not true.

This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to very short strings, which will obviously still fit into the cache, and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.

I don'n understand what you meant about "test macro", still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache. And you may even decrease that number even lesser. Still, you continue to use citations and phrases without understanding what was said there, but do you really think this is a valid argument to argue or try to claim that the claims said in the research are not true? That's ignorance.

Well, you was not going to make any arguments (is not this sounds funny in the discussion, which is only based on arguments? if not on arguments, then on which basis then will you construct your disput?), well, ok, you do not make any arguments, but you disprove the claim of the research just because you "don't agree". Did not you see that this behaviour is the behaviour of fanatic belief in some thing? You will not make arguments, but you'll belive in what you think and that's all for you.

If you don't understand what is said here - it may be ignorance, or just attentionless, or just lack of understanding, then it is even more ignorant to try to disprove or make your own claims about "trueness" or not "trueness" of the research, and the citation about "millions of calls". If you don't understand what was in base of the research and if you don't believe - then this is your right to do your work as you want for yourself. But it looks like it is you who repeats to use some cications like mantras without understanding of them and the underlying things on which they are based. And the paranoid desire to (dis)prove something about the points you did not understand. If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that. But the test with millions of the different (even short) strings passed to the algo consecuently will be the only test which will show the reasl peformance of your algo - that's what was said in the research, that's what was said by Hutch, too, and that's what was implemented by Jochen in a code using 200 MB of strings to test over strlen, and it showed the claims here true. But, still, these are not arguments which you, in your fanatic beliefs, make take your notice to.

Quote
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.

This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.

Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code. Though, again, real world code then has no any gain from using complex and long SSE code, to get the same performance after all, as the rep movsd will provide. SSE is for number crunching - and there it might has advantages of cache help over the small data sets, and even without that help using big data sets the SSE still have gains just of more wider ALU arithmetic. But you may notice then, then for StrLen code its speed is only 2 times higher than regular non-scasb GPRs code in real world conditions. And being even 10 times overcomplicated and quasi-sophisticated, the code will still not be any faster than that in the real world. But in the testbeds used on the forum the code wil be faster even in 10 times, and the code writer will think that the development time he put into the code is not useless, even though the code in real world will not be faster more than 2 times than GPRs code. This is what you believe in - the idol of forums testbed's "millions of loops over the same short data". This idol is false, and in the real world and in any real app the code posted page earlier, being long at 68 bytes, will perform the same speed as the Intel's 800+ bytes code, you, maybe, find this "interesting", but you still continue to believe in your idol of testbed's timings. Well, there you might see the "magic promises" from your belief, but in the real application you will see the speed up which will be limited to the things you do not want to know of (instead of that you want to "disprove" them).

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #47 on: March 30, 2015, 10:30:18 AM »
Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

I am afraid that Alex is right here. There is a reason why I am reading multiple copies of bible.txt in the fat files thread; and the reason is precisely that modern CPUs have several data caches - see Multi-level caches. Level 1, the fastest, was 8k in a P4; today, the Haswell architecture features three levels:
Quote
L1 cache    64 KB per core
L2 cache    256 KB per core
L3 cache    2 MB to 20 MB shared

It is obvious that even if you swap pointers, your complete data are in L1. If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible. This is why timings differ dramatically for a) multiple access to a small memory area in cache vs b) few accesses to a big memory area that must almost completely shuffled into the cache before it can be used by the CPU.

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #48 on: March 30, 2015, 11:57:18 AM »
You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box".

The cache system is developed by the same people who designed the standard IO-stream used in most (if not all) software today. The cache is improved over time in both ends, where faster file and CPU cache speeds up the process of streaming.

The problem with these streams is usage. They are in all textbooks and samples, and in all standard libraries. They use a small buffer (default is one page) by design and normally fetch bytes in a sequence. The file cache then pays the cost of filling the buffer, and the size of the current work set fits very well in the CPU cache.

This isn't necessarily logical compare to allocate a few hundred megs to fit the whole file in memory instead of reading small chunks, but given both software and hardware development favour this type of behaviour the streaming continues.

This means that most binaries that swallow text files use this method, and that will be the applications that need a fast string library. This is the reason why most of these functions are written in assembly, and given the variation in the catch systems often custom made for each CPU. The aim of the testing will then be as Hutch said to find one who works OK in most cases.

Quote
"All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself).

Yes, all data must be loaded into the cache as explained above. However, the cache system don't know the size of your buffer, so there is a lot of guesswork done there. Same with the cache line fed to the CPU where only the first chunk of data is real and the rest is guessed.

Quote
The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you?

My assumptions is explained above so I disagree with your conclusion regarding the definition of a "real world" applications use of the cache. My take is that applications with heavy use of string functions as explained above most likely will (by design) always work on cached data.

Quote
If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.

As I said before I don't disagree with the research but with some of the conclusions you made from the result.

Quote
This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to very short strings, which will obviously still fit into the cache

The catch line fed to the CPU contains "guessed" data, and it may be 32/64/128 byte long. Two small strings will then fit in there given the are close. The only way to disable the cache (in this case from using this data) is to flush the line. This is done by putting some garbage between the strings.

Quote
and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.

Well, if you search a million files for the word "million" the same string is touched millions of time in the process. However, the point is that the dataset you work with has a fixed address and (small) size and thereby more predictable and stable.

Quote
I don'n understand what you meant about "test macro",

The cache has many layers but the one close to the CPU is the one with the current data executed. This is loaded when memory is touched, and if this has a different address then previous the current cache line is flushed. If the same (or spatial) address is loaded there may be a copy of this chunk in the current line which will be reused.

Quote
still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache.

As said before it will all be loaded into the cache regardless how big it is.

Quote
If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that.

If you have any other method of disable the cache I'm all ears.  :P

Quote
Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code.

Think you missed the point. It's MOVSB which is the fastest in this case. If you run the test on your CPU (or mine) the result would be opposite, so this has to do with new technology. If you are going to use real math to calculate the behaviour of the cache you need to use a specific CPU. The behaviour is different from each model and brand.

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #49 on: March 31, 2015, 05:56:36 AM »
There is a reason why I am reading multiple copies of bible.txt

  :biggrin:

Quote
modern CPUs have several data caches

So what?

The transfer speed between memory is interesting but I made a point of the fact that this has no bearing on test results with regards to testing algos.

Slow transfer speed: no problem - same result.
Fast transfer speed: no problem - same result.

Given the testbed is testing software and not hardware the only problem will be an unstable transfer speed.

Quote
It is obvious that even if you swap pointers, your complete data are in L1.

When data is transferred from memory to cache a chunk of data corresponding to the chip set will be copied. This is a fast way of copy "a section of the chip" if you will, or a "cache line".

The size of this chunk I think is related to the size of the bus, so a 64-bit bus will have a chunk size of 64 byte.

Code: [Select]
movdqa  xmm0,[eax] ; load 64 byte - memory
movdqa  xmm1,[eax+16] ; use cache line +16

Quote
If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible. This is why timings differ dramatically for a) multiple access to a small memory area in cache vs b) few accesses to a big memory area that must almost completely shuffled into the cache before it can be used by the CPU.

True, using the cache is faster. However, in one call (assuming 200M long string) none of the fetched lines will be reused but nevertheless pushed into the cache.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Research on the algos design
« Reply #50 on: March 31, 2015, 10:44:02 AM »
I wonder at some of the comments in this enduring debate. The vast majority of relevant information is contained in data and code is only a factor if it is big enough to break the code cache, a situation that is best avoided for maximum performance. Now depending on the age and type of hardware, there have been different strategies to load data from main storage into the processing component of hardware and we know that in any of the more recent processors that you have a sequence of caches that increase in speed and decrease in size as they get closer to the data processing component of the hardware.

Now if its data that is specific to the algorithm (lookup table or similar) you need to keep its size down and have it accessed so that it remains as close as possible to the processing component of the hardware. The rest is data that you are going to process. You shift from data storage to a succession of data caches that become progressively closer to the processing component of the hardware. This was done historically to improve the throughput of data rather than have to keep accessing the data in main storage or as a single block of allocated memory.

Now apply that to testing algorithms and you have some guidelines as to what you need to do to create a test for different algorithms that are being applied to the same task. At its simplest if you need to test combinations of mnemonics without regard to the data being processed then short loops in level 1 cache can do the job OK as long as you understand the limitation of this testing technique.

At the other end of the scale, if you have to process very large data in a linear manner, you need to construct a test that emulates the conditions you will use the algorithm for. A gigabyte sized data file does the job here as it emulates the cache load and in some cases the write back of data to either the same memory or an allocated block at another address. Another consideration here is the option in later SSE instructions to make non temporal writes to avoid two way cache pollution. There are contexts where cached reads and non temporal writes give big performance gains if executed properly.

Between these two ends if a massive range of different tasks that differ in terms of data access and iteration rate. A reasonable amount of code is called at an incidental level rather than high frequency hammering and the test here is attack, not necessarily linear read time. Designing a test for this type of condition is a lot harder than it looks because while you may be able to design an algorithm that is a super star in the test bed, unless you match the algorithm to the task being performed it runs the risk of getting its ass kicked by another algorithm that better matches the task.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #51 on: March 31, 2015, 09:50:56 PM »
True, using the cache is faster. However, in one call (assuming 200M long string) none of the fetched lines will be reused but nevertheless pushed into the cache.

And this is the exact reason why processing already cached data (e.g. two little strings) is different from processing 200MB that must be pushed into the cache before they become available to the CPU.

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #52 on: April 01, 2015, 02:04:49 AM »
And this is the exact reason why processing already cached data (e.g. two little strings) is different from processing 200MB that must be pushed into the cache before they become available to the CPU.

According to your philosophy of "two little strings" they may each be two million bytes long and still fit in a 4M cache, but we speak of different things here.

The pushing in this case happens after you finish processing the data you already have copied from memory. It shows that if you don’t play along with the system you will be punished.

Quote
If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible.

Think you find it will be the other way around. All chunks from the 200M-buffer will be copied and processed in the fast lane. It’s only the waste product that will populate (and overwrite) the rest of the cache.

When you stream the data the same thing happens. Data will be copied from the buffer (memory) and processed in the fast lane and will occupy 4096 byte of L1. The difference is that when the stream refills the buffer, the content of this memory region change, and the corresponded region in L1 will be invalidated and not populated into the rest of the cache.

Another consideration here is the option in later SSE instructions to make non temporal writes to avoid two way cache pollution. There are contexts where cached reads and non temporal writes give big performance gains if executed properly.

If the work set of data in the application fits well in the cache there may be a performance gain in just not overwrite it. There are some library functions that do this but I have no idea how they work. Maybe they do a block copy and trash the source as they go along.

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #53 on: April 01, 2015, 05:52:49 AM »
The pushing in this case happens after you finish processing the data you already have copied from memory.

Oh really?

include \masm32\MasmBasic\MasmBasic.inc      ; download
.data
S1      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
S2      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
  Init
  Recall "Bible200.txt", L$()
  mov ebx, eax      ; #of strings
  PrintCpu 0
  Print "Little exercise showing the influence of the cache:"
  REPEAT 3
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(L$(ecx))
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms to get the length of ", NanoTimer(ms)), Str$("%i strings", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 1 string", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
            add edi, Len(offset S2)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 2 strings", ecx/2), Str$(" with %i total bytes\n", edi)
  ENDM
  Inkey "And now??"
  Exit
end start


Output:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Little exercise showing the influence of the cache:
234 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
69 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

236 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

237 ms to get the length of 6076600 strings with 803401800 total bytes
66 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #54 on: April 01, 2015, 07:17:10 AM »
The pushing in this case happens after you finish processing the data you already have copied from memory.

Oh really?

 :biggrin:

Yes, really.

In addition to the slow copy from memory to cache you are also hit by a massive reshuffle of the cache in the same process. You’re being punished.

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #55 on: April 01, 2015, 08:48:40 AM »
Believe what you want to believe. Why am I wasting my time here? ::)

rrr314159

  • Member
  • *****
  • Posts: 1383
Re: Research on the algos design
« Reply #56 on: April 01, 2015, 09:41:58 AM »
Why am I wasting my time here? ::)

... when I could be wasting it somewhere else? :biggrin:
I am NaN ;)

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #57 on: April 01, 2015, 09:51:32 AM »
How about believe in your own test results?

A 8M buffer will overflow a cache of 4M in a loop. The last cache line in the loop will flush out the last line of the first part.
Code: [Select]
.data?
long db 800000h dup(?)

Using the same lengths as before
Code: [Select]
.data
length_table label dword
dd 1,5,9,16,19,22,28,36,41,35,27,21,18,15,8,4,2
; dd 42,45,48,49,56,59,62,68,76,81,75,67,61,58,55,50,47,44
; dd 600,601,605,609,716,719,822,828,836,999,905,827,821,718,715,608,604,602
dd 0

Make strings from table
Code: [Select]
lea edi,long
mov ecx,size_d
mov eax,'x'
rep stosb
mov byte ptr [edi-1],0
lea edi,long
l1:
lea esi,length_table
l2:
lodsd
test eax,eax
jz l1
add edi,eax
cmp edi,offset endlong
jnb l3
mov [edi],cl
jmp l2
l3:

Loop through the 8M buffer x times
Code: [Select]
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
l1:
lea esi,long
l2:
push esi
call ProcAddress
mov esp,ebx
lea esi,[esi+eax+1]
cmp esi,offset endlong
jnb l1
dec edi
jnz l2
counter_end

The result for each of the lengths
Code: [Select]
result: string length 2..40 - no cache
   424283 cycles - (105) proc_1: SSE 32
   433102 cycles - (104) proc_2: AxStrLenSSE
   445574 cycles - (673) proc_8: SSE Intel Silvermont
   474455 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   535386 cycles - (  0) proc_0: crt_strlen
   563018 cycles - (818) proc_9: SSE Intel Atom

result: string length 40..80 - no cache
   493896 cycles - (105) proc_1: SSE 32
   497500 cycles - (673) proc_8: SSE Intel Silvermont
   552300 cycles - (104) proc_2: AxStrLenSSE
   580440 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   740700 cycles - (818) proc_9: SSE Intel Atom
  1002596 cycles - (  0) proc_0: crt_strlen

result: string length 600..1000 - no cache
  3715463 cycles - (105) proc_1: SSE 32
  4055950 cycles - (673) proc_8: SSE Intel Silvermont
  4063407 cycles - (104) proc_2: AxStrLenSSE
  4132161 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  4496775 cycles - (818) proc_9: SSE Intel Atom
  7245457 cycles - (  0) proc_0: crt_strlen

Using the whole buffer
Code: [Select]
result: string length 8M - no cache
  6236043 cycles - (673) proc_8: SSE Intel Silvermont
  6317136 cycles - (818) proc_9: SSE Intel Atom
  7438441 cycles - (105) proc_1: SSE 32
  8087361 cycles - (104) proc_2: AxStrLenSSE
  8191742 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
 14104841 cycles - (  0) proc_0: crt_strlen

Here the same test using small strings (cached). Note that the difference in real time between these tests will be large, this one obviously much faster. The result will differ depending on how the whole loop take advantage of the cache, so this will tilt the algos a bit.

Code: [Select]
result: string length 2..40 - using cache
   263056 cycles - (673) proc_8: SSE Intel Silvermont
   282322 cycles - (105) proc_1: SSE 32
   327211 cycles - (104) proc_2: AxStrLenSSE
   330122 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   373559 cycles - (818) proc_9: SSE Intel Atom
   424289 cycles - (  0) proc_0: crt_strlen

result: string length 40..80 - using cache
   330421 cycles - (105) proc_1: SSE 32
   389739 cycles - (673) proc_8: SSE Intel Silvermont
   456273 cycles - (818) proc_9: SSE Intel Atom
   469557 cycles - (104) proc_2: AxStrLenSSE
   482406 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   939124 cycles - (  0) proc_0: crt_strlen

result: string length 600..1000 - using cache
  1327363 cycles - (673) proc_8: SSE Intel Silvermont
  1479271 cycles - (105) proc_1: SSE 32
  1522073 cycles - (818) proc_9: SSE Intel Atom
  2040309 cycles - (104) proc_2: AxStrLenSSE
  2084028 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  5785043 cycles - (  0) proc_0: crt_strlen

The question was which of these two conditions was the most realistic one. I assumed the latter.