Hi Steve!
Things were set up better than I thought. So it took less time
than I thought. Tried to run on four systems. Did not work well
on Windows 8.1. Either Windows UAC or the antivirus hated it.
Cut and paste results from run.bat in the error message boxes.
Ended up locking up the Command Prompt once. And the second
program did not seem to run.
Thank you for the comprehensive tests :t As usually it is interesting how differently the CPUs behave.
PIII obviously has the influence of patch near to the running code, but the queue depth was not determined with the second program in every test, so, either the queue is too short or, and what is probably true, it just knows that the part patched isn't the code running.
PMMX (you mean Win98 there?) has also influence of patch near to the running code, and the stall is higher than with PIII, but still only ~3 cycles per patched dword. But it is probably due to short prefetching and not due to advanced prefetching/decoding.
Pentium M has influence too, and it and PIII both have that influence when the executable page changed, even when patching code is not near the patched place, so probably the stalls are mostly because of CPU checks what was actually patched and not prefetcher refill. This model of Pentium M based on PIII code?
i7 has influence, too, but it is very small, probably more or less CPUs just know that the patched place isn't the code running.
As a side note: it seems that the PIV, particularly Prescotts, are the slowest CPUs with TOOOOO deep pipelines and the logic of prefetching was really rude for that long pipelines - the CPU doesn't actually knows, what is patched, it just brutally refills the queue with fixed size length checking, and the refilling is very slow - more than 50 cycles for one patch.
Also it probably needed to rewrite second app a bit - to patch the actually executing code to see the stalls more precisely on every CPU.