The MASM Forum

General => The Laboratory => Topic started by: Antariy on January 16, 2015, 06:47:37 AM

Title: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 06:47:37 AM
This is tests for the stalls checking and for the prefetching length checking. The results from different machines are appreciated :t

The code is self-explanatory, if one wants to rebuild the programs then one needs to specify /SECTION:SEC1,rwe arguments in the linker's command line.

Patch_Stalls_Test.exe checks how long are the stalls in the code which gets modified. For the simplicity the modification code and the stalls checking code is the same code and located in the same page, but one can run the modification code from other page and see if the patching the code itself slowing the pathing code down, to do this need to specify 1 as a command line argument to the EXE. Also it's possible to redirect modification from the page where the code runs to the other page - in the data section, and see that the patching of the executable page itself is not slowing down the code, until the code which runs located in the space of prefetching length - i.e. just compare the results of this run and run when 1 is specified as a command line. To redirect modification to the data section - specify 4 as the command line argument.

To simplify the run, run the Run.bat (how many "run" words nearly :biggrin:), but the second executable (read below) needs to be runned manually from command line, to redirect the output.

The results:

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 5922186048

Test with the code from the same page but patching the data in the other section

Press a key after number appears
cycles for SEC1: 104518976

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 105681304

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 104268832


How to read the results: First test is the test where the code which is running is located in the same page and near to the patched place, so it's in the prefetching queue, the patching code is the same code for simplicity (the program was intended to run on old machines as well, so real multithreading is not actually required), but this has no difference. One may see that the code runs very slow - in comparsion with the second test.

Second test is the run of the same code but with redirection to the page in the data section. It runs more than 50 times faster, so that may be rougly called a stall of a patch - 50+ cycles for full/partial (depends on the implementation and is not known without project documentation) refill of the queue.

Third test is the patching of the executable page, but the patching code is located in the other executable page, and the results are the same as in the second test, so this shows that the fact of the EXECUTABILITY of the pages of the section is not the reason why CPU does this, it refills the code not by executable flag, but just in the prefetching queue. So, if the patched place (code or data) is located outside queue, there will not be a stalls when patching. To check how long is the queue - run the second executable from archive, read a bit below.

Fourth test is just "intermix" of second and third tests - the code running from an other page than that which is the main testing piece, patches the byte in the page of data section, which has no executable mark, and the speed is the same as in 2 and 3 tests, so this is just one more example that the executability mark is not the reason of the refilling - the reason is the place which CPU runs currently, the queue of that place, and it watches only this place, so there is no stalls when patching outside the queue.


So, if the CPU does refill of the prefetch queue after patch, this test will show it as it runs in protected mode with page addressation, in Windows environment. To check that just enough to see to the first two tests, if the numbers are drastically different, then the CPU does refill (even if in the real-mode it doesn't), if the numbers are nearly equal - the CPU doesn't refill.


To check how long is the queue, the second program from the archive, Patch_Stalls_Test_Prefetch_Length.exe, was written.

It runs the same test as in the first program, but sequentally increments the byte pointer, at which it patches the page. The idea is that when the pointer will be out of the scope of the prefetching queue, the CPU will not refill it again after every patch, so there will not be a slowdown, and this will be shown in the results as a drastical difference in the timings of the running code.

The output of the program is huge, so, please redirect its output to the file, then open it with text editor, and find where the notable difference in the timings occur - it is very notable and looks like first there are a bunch of the similar numbers after "cycles:", and then other bunch of the other numbers that have much less value. Please, copy the lines near of the difference "leap", and paste it to the forum, also you may save your text file with the name of your CPU, zip it and attach to the post, if you see some interesting patterns in the difference - maybe some CPUs have some "interleaving" in the queue, for an instance, if so - the entire file will be interesting.

The results cut:

Address: 004023FE, cycles: 035976F0
Address: 004023FF, cycles: 0355BF18
Address: 00402400, cycles: 000F4680
Address: 00402401, cycles: 000FDDC0


The page starts from 402000, so for Celeron D310 it's 400h bytes of the queue prefetching length, and the CPU maps that not from the EIP value, but from the start of the page - we can see that because the patching beyond 400h (402400 and above) does not cause stalls, and knowing that the testing code is in the same page and actually it takes some place of it, so the offset of the patching space is not zero, relatively to the start of the page, but the queue has "round" value - 400h, so CPU prefetches not relatively from offset of the currently running code, but relatively of the page start.

Also some side notes: from knowing the queue length it is possible to solve the "code placement" problem - now it is possible to determine, how long is the queue, and the address of the tested code, to check, what is near it in the page.

But the proposal I was suggesting: to use different code sections for every testing algo in the sources, solving this problem without computations - every algo will be just in its place, and prefetching queue will be filled with its code.

As for the unconditional jump over the data placed in the code - it doesn't help to avoid stalls - the CPU just refills the prefetching queue when the patch was in the current queue place.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 07:48:49 AM
P-4 Prescott w/htt @ 3.00 GHz
Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 6016809188

Test with the code from the same page but patching the data in the other section

Press a key after number appears
cycles for SEC1: 101633866

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 101628263

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 101085968
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:07:29 AM
Thank you, Dave! :t

How long is queue of you CPU? The second exe from the archive, Patch_Stalls_Test_Prefetch_Length.exe, reports that, just redirect its output to the file and attach it to the post, if you would like to test that.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 08:13:42 AM
that EXE runs forever - lol
Title: Re: Prefetch stalls and queue length tests
Post by: hutch-- on January 16, 2015, 08:16:08 AM
Alex,


Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 116407773
Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 114670101
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 115656843
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 115897284


Hint, when redirecting output, for messages that should be seen on the screen while it running, use STDERR. The StdOut redirects to output to file.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 08:21:58 AM
i ran that program - it runs 50% CPU for a while - then drops to 0 and seems to hang
the resulting text file is 148 kb

Address: 00402041, cycles: 041C75E9
Address: 00402042, cycles: 0421D890
Address: 00402043, cycles: 042B5EDA
Address: 00402044, cycles: 04265AC6
Address: 00402045, cycles: 043B63BE
Address: 00402046, cycles: 04210816
Address: 00402047, cycles: 04221CF9
.
.
Address: 004023FE, cycles: 0428C2DB
Address: 004023FF, cycles: 041E7964
Address: 00402400, cycles: 000F45EE
Address: 00402401, cycles: 000FBC59
.
.
Address: 00402FC9, cycles: 000F450E
Address: 00402FCA, cycles: 000FBB4C
Address: 00402FCB, cycles: 000F43FF
Address: 00402FCC, cycles: 000F44D9
Address: 00402FCD, cycles: 000FBBF8
Address: 00402FCE, cycles: 000F441E
Address: 00402FCF, cycles: 000F44BB
Address: 00402FD0, cycles: 000FEB5F
Address: 00402FD1, cycles: 000F43D2
Address: 00   <------------------------------------ hung
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:24:22 AM
Quote from: dedndave on January 16, 2015, 08:13:42 AM
that EXE runs forever - lol

No, it just runs a long time :biggrin: Just wait a couple of minutes (5 mins I think is the maximum for your CPU), if that's not too boring.
May you change the variable TEST_COUNT to the smaller value, for an instance to the 10 times smaller than it is, and recompile the program? And the test will run faster. Don't forget to set the /SECTION argument to the linker, as mentioned in the first post.

The point is the code checks every byte of the page, so there is ~4000 checks, and the same count of the lines in the output file. It doesn't hang, but just runs slowly.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 08:26:08 AM
ok - started at 2:25.....
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:29:47 AM
Quote from: hutch-- on January 16, 2015, 08:16:08 AM
Alex,


Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 116407773
Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 114670101
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 115656843
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 115897284


Hint, when redirecting output, for messages that should be seen on the screen while it running, use STDERR. The StdOut redirects to output to file.

Interestingly - your CPU doen't stalls at all after a patch the location near to the code running. Maybe that's an advanced in the CPU logic which clearly determined that the location patched is not a part of the code.

The program doesn't report anything except the results dump, so there was nothing to report about, but it was needed to mention that the code will run long enough and might seem as hunged.
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:35:37 AM
[quote author=dedndave link=topic=3960.msg41646#msg41646 date=1421356918]
i ran that program - it runs 50% CPU for a while - then drops to 0 and seems to hang
the resulting text file is 148 kb

Address: 00   <------------------------------------ hung

[/quote]

Oh, that seems as the some part of the output stdout buffer was not dumped into file - that maybe a MSVCRT problem, or maybe you finished the program hitting Ctrl+C/Break? After CPU usage drops to zero, just press any key in console window, so the program will finish and the buffer will be dumped to the disk by MSVCRT properly.
It was required to mention that in the first posts ::) Or maybe remove the waiting for the key in the program, and just make a beep, also maybe user input for the loops count to adopt to older CPUs. Well, first release was "beta" :greensml: Joking apart it's working just missing some things to be comfortable to use.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 08:38:55 AM
invoke crt__getch           ;<------- it's waiting for a keypress - lol
invoke crt_exit,edi


Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:39:42 AM
Dave, you've done it - even with not fully dumped buffer the results are there:


Address: 004023FE, cycles: 0428C2DB
Address: 004023FF, cycles: 041E7964
Address: 00402400, cycles: 000F45EE
Address: 00402401, cycles: 000FBC59


So your CPU with this method shows the same queue length as mine, but probably on modern CPUs the method will not work because the CPU detects that the part changed isn't the code, but it's interesting to see.
Title: Re: Prefetch stalls and queue length tests
Post by: dedndave on January 16, 2015, 08:44:32 AM
(http://markosun.files.wordpress.com/2010/09/christopherwalken.png)

you're supposed to press a key, Chris !!!!
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:46:40 AM
Quote from: dedndave on January 16, 2015, 08:38:55 AM
invoke crt__getch           ;<------- it's waiting for a keypress - lol
invoke crt_exit,edi


That's why I've said in the main post "the code is self-explanatory" :biggrin: Sorry for not saying that in the post.
So first time you've finished it with Ctrl+C? And the buffer just didn't dump to the file.

So I must note to Steve (FORTRANS), if you will run the tests, then second of them may take some time, you may rebuild the app as said in the main post with smaller loop count value so it will run faster, I just selected small enough values which give stable timings for my CPU, but don't know which value will be better for your CPU(s).
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 16, 2015, 08:53:14 AM
Quote from: dedndave on January 16, 2015, 08:44:32 AM
you're supposed to press a key, Chris !!!!

LOOOL Dave! :greensml:
Title: Re: Prefetch stalls and queue length tests
Post by: FORTRANS on January 16, 2015, 09:19:54 AM
Hi Alex,

   Things were set up better than I thought.  So it took less time
than I thought.  Tried to run on four systems.  Did not work well
on Windows 8.1.  Either Windows UAC or the antivirus hated it.
Cut and paste results from run.bat in the error message boxes.
Ended up locking up the Command Prompt once.  And the second
program did not seem to run.

P-III, Windows 2000

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 222648314


Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 124562973
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 232035935
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 124891490

P-MMX, Windows 98

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 309385257

Test with the code from the same page but patching the data in the other section

Press a key after number appears
cycles for SEC1: 116441193

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 242751843

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 120369630

Pentium(R) M, Windows XP

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 239035281

Test with the code from the same page but patching the data in the other sectio

Press a key after number appears
cycles for SEC1: 129453196

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 226025484

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 128404990

i3, Windows 8.1 (Results may be bad)

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 157071075

Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 138518480

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 140490952

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 138764775


HTH,

Steve N.

Edit:  Fixed label for P-MMX to Windows 98.

SRN
Title: Re: Prefetch stalls and queue length tests
Post by: jj2007 on January 16, 2015, 04:59:10 PM
Tests on i5:
cycles for SEC1: 140379273
cycles for SEC1: 109848725
cycles for SEC2: 138419402
cycles for SEC2: 134287359

Thanks to Dave for revealing the secret of the hanging application :bgrin:
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 17, 2015, 06:54:16 AM
Hi Steve!

Quote from: FORTRANS on January 16, 2015, 09:19:54 AM
   Things were set up better than I thought.  So it took less time
than I thought.  Tried to run on four systems.  Did not work well
on Windows 8.1.  Either Windows UAC or the antivirus hated it.
Cut and paste results from run.bat in the error message boxes.
Ended up locking up the Command Prompt once.  And the second
program did not seem to run.

Thank you for the comprehensive tests :t As usually it is interesting how differently the CPUs behave.

PIII obviously has the influence of patch near to the running code, but the queue depth was not determined with the second program in every test, so, either the queue is too short or, and what is probably true, it just knows that the part patched isn't the code running.
PMMX (you mean Win98 there?) has also influence of patch near to the running code, and the stall is higher than with PIII, but still only ~3 cycles per patched dword. But it is probably due to short prefetching and not due to advanced prefetching/decoding.
Pentium M has influence too, and it and PIII both have that influence when the executable page changed, even when patching code is not near the patched place, so probably the stalls are mostly because of CPU checks what was actually patched and not prefetcher refill. This model of Pentium M based on PIII code?
i7 has influence, too, but it is very small, probably more or less CPUs just know that the patched place isn't the code running.

As a side note: it seems that the PIV, particularly Prescotts, are the slowest CPUs with TOOOOO deep pipelines and the logic of prefetching was really rude for that long pipelines - the CPU doesn't actually knows, what is patched, it just brutally refills the queue with fixed size length checking, and the refilling is very slow - more than 50 cycles for one patch.
Also it probably needed to rewrite second app a bit - to patch the actually executing code to see the stalls more precisely on every CPU.
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 17, 2015, 06:56:57 AM
Hi Jochen!

Quote from: jj2007 on January 16, 2015, 04:59:10 PM
Tests on i5:
cycles for SEC1: 140379273
cycles for SEC1: 109848725
cycles for SEC2: 138419402
cycles for SEC2: 134287359

Thank you for the test! :t

And probably no differences in timings dump with the second program, as with other modern CPUs?
Title: Re: Prefetch stalls and queue length tests
Post by: jj2007 on January 17, 2015, 08:36:14 AM
Quote from: Antariy on January 17, 2015, 06:56:57 AMAnd probably no differences in timings dump with the second program, as with other modern CPUs?

Hi Alex,

Attached i5 and Celeron M results for the other program.
Title: Re: Prefetch stalls and queue length tests
Post by: FORTRANS on January 17, 2015, 09:28:07 AM
Hi Alex,

Quote from: Antariy on January 17, 2015, 06:54:16 AM
Thank you for the comprehensive tests :t As usually it is interesting how differently the CPUs behave.

   You're welcome.  Glad you got some useful information.

QuotePMMX (you mean Win98 there?)

   Yes.  P-MMX is running Windows 98.  Thanks for pointing out
the error.

QuoteThis model of Pentium M based on PIII code?

   I believe both are family 6 processors.  Here are some trimmed
results for the the first three reported above.  That program won't
run on Win 8.1 64-bit.

[Pentium III]

This processor: GenuineIntel
Processor Signature: 00000683h
Family Data: 006h
Model Data : 08h
Stepping : 3h

Maximum CPUID Standard and Extended Functions:
CPUID.(EAX=00h):EAX: 02h
CPUID.(EAX=80000000h):EAX: 03020101h

[Pentium M]

This processor: GenuineIntel
Brand String: Intel(R) Pentium(R) M processor 1.70GHz
Processor Signature: 000006D6h
Family Data: 006h
Model Data : 0Dh
Stepping : 6h

Maximum CPUID Standard and Extended Functions:
CPUID.(EAX=00h):EAX: 02h
CPUID.(EAX=80000000h):EAX: 80000004h

[Pentium MMX]

This processor: GenuineIntel Pentium(R)
Processor Signature: 00000543h
Family Data: 005h
Model Data : 04h
Stepping : 3h

Maximum CPUID Standard and Extended Functions:
CPUID.(EAX=00h):EAX: 01h
CPUID.(EAX=80000000h):EAX: 00000000h


Regards,

Steve N.
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 18, 2015, 07:18:56 AM
Quote from: jj2007 on January 17, 2015, 08:36:14 AM
Quote from: Antariy on January 17, 2015, 06:56:57 AMAnd probably no differences in timings dump with the second program, as with other modern CPUs?
Attached i5 and Celeron M results for the other program.

Thank you, Jochen! :t Yes, the socond program shows no difference from the place of patch in timings. Celeron M though, seems to don't like a bit the code patching independently where it is patched.
Title: Re: Prefetch stalls and queue length tests
Post by: Antariy on January 18, 2015, 07:27:48 AM
Hi Steve!

Quote from: FORTRANS on January 17, 2015, 09:28:07 AM
QuoteThis model of Pentium M based on PIII code?

   I believe both are family 6 processors.  Here are some trimmed
results for the the first three reported above.  That program won't
run on Win 8.1 64-bit.

I mistyped in the "code" word - was mean PIII core, but I think you did understand that correctly :t Yes, I thought about that that Pentium M model based on PIII core just because the behaviour of the CPUs are very same, so, Pentium M (and Celeron M) were the more optimized for performance and power saving "derives" from the desktop cores (different families), this similarity has told some notes on the core.