Do u realize the last 12 posts in this thread are yours? Figured I'd break your streak :)
But also it's interesting you're using float instead of integer in the for-next loop, because I've just done the similar thing with timing macros. I'll soon post a "faster" 32-bit algo in the laboratory - this time using MichaelW's timers, everything "by the book", but with "massively parallel" use of xmm registers. I'd gotten used to returning 64-bits from the rdtsc call and couldn't stand being confined to 32-bits again (1 second vs. 100 years) so wrote a 32-bit version of timers using REAL8 to hold the rdtsc result. Works fine, it's fast enough, this way you can add, average print out etc the timing figures easily.
Anyway, c u later!