This is tests for the stalls checking and for the prefetching length checking. The results from different machines are appreciated :t
The code is self-explanatory, if one wants to rebuild the programs then one needs to specify /SECTION:SEC1,rwe arguments in the linker's command line.
Patch_Stalls_Test.exe checks how long are the stalls in the code which gets modified. For the simplicity the modification code and the stalls checking code is the same code and located in the same page, but one can run the modification code from other page and see if the patching the code itself slowing the pathing code down, to do this need to specify 1 as a command line argument to the EXE. Also it's possible to redirect modification from the page where the code runs to the other page - in the data section, and see that the patching of the executable page itself is not slowing down the code, until the code which runs located in the space of prefetching length - i.e. just compare the results of this run and run when 1 is specified as a command line. To redirect modification to the data section - specify 4 as the command line argument.
To simplify the run, run the Run.bat (how many "run" words nearly

), but the second executable (read below) needs to be runned manually from command line, to redirect the output.
The results:
Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 5922186048
Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 104518976
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 105681304
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 104268832
How to read the results: First test is the test where the code which is running is located in the same page and near to the patched place, so it's in the prefetching queue, the patching code is the same code for simplicity (the program was intended to run on old machines as well, so real multithreading is not actually required), but this has no difference. One may see that the code runs very slow - in comparsion with the second test.
Second test is the run of the same code but with redirection to the page in the data section. It runs more than 50 times faster, so that may be rougly called a stall of a patch - 50+ cycles for full/partial (depends on the implementation and is not known without project documentation) refill of the queue.
Third test is the patching of the executable page, but the patching code is located in the other executable page, and the results are the same as in the second test, so this shows that the fact of the EXECUTABILITY of the pages of the section is not the reason why CPU does this, it refills the code not by executable flag, but just in the prefetching queue. So, if the patched place (code or data) is located outside queue, there will not be a stalls when patching. To check how long is the queue - run the second executable from archive, read a bit below.
Fourth test is just "intermix" of second and third tests - the code running from an other page than that which is the main testing piece, patches the byte in the page of data section, which has no executable mark, and the speed is the same as in 2 and 3 tests, so this is just one more example that the executability mark is not the reason of the refilling - the reason is the place which CPU runs currently, the queue of that place, and it watches only this place, so there is no stalls when patching outside the queue.
So, if the CPU does refill of the prefetch queue after patch, this test will show it as it runs in protected mode with page addressation, in Windows environment. To check that just enough to see to the first two tests, if the numbers are drastically different, then the CPU does refill (even if in the real-mode it doesn't), if the numbers are nearly equal - the CPU doesn't refill.
To check how long is the queue, the second program from the archive, Patch_Stalls_Test_Prefetch_Length.exe, was written.
It runs the same test as in the first program, but sequentally increments the byte pointer, at which it patches the page. The idea is that when the pointer will be out of the scope of the prefetching queue, the CPU will not refill it again after every patch, so there will not be a slowdown, and this will be shown in the results as a drastical difference in the timings of the running code.
The output of the program is huge, so, please redirect its output to the file, then open it with text editor, and find where the notable difference in the timings occur - it is very notable and looks like first there are a bunch of the similar numbers after "cycles:", and then other bunch of the other numbers that have much less value. Please, copy the lines near of the difference "leap", and paste it to the forum, also you may save your text file with the name of your CPU, zip it and attach to the post, if you see some interesting patterns in the difference - maybe some CPUs have some "interleaving" in the queue, for an instance, if so - the entire file will be interesting.
The results cut:
Address: 004023FE, cycles: 035976F0
Address: 004023FF, cycles: 0355BF18
Address: 00402400, cycles: 000F4680
Address: 00402401, cycles: 000FDDC0
The page starts from 402000, so for Celeron D310 it's 400h bytes of the queue prefetching length, and the CPU maps that not from the EIP value, but from the start of the page - we can see that because the patching beyond 400h (402400 and above) does not cause stalls, and knowing that the testing code is in the same page and actually it takes some place of it, so the offset of the patching space is not zero, relatively to the start of the page, but the queue has "round" value - 400h, so CPU prefetches not relatively from offset of the currently running code, but relatively of the page start.
Also some side notes: from knowing the queue length it is possible to solve the "code placement" problem - now it is possible to determine, how long is the queue, and the address of the tested code, to check, what is near it in the page.
But the proposal I was suggesting: to use different code sections for every testing algo in the sources, solving this problem without computations - every algo will be just in its place, and prefetching queue will be filled with its code.
As for the unconditional jump over the data placed in the code - it doesn't help to avoid stalls - the CPU just refills the prefetching queue when the patch was in the current queue place.