The MASM Forum

General => The Soap Box => Topic started by: Raistlin on September 05, 2017, 06:49:46 PM

Title: Cache again....
Post by: Raistlin on September 05, 2017, 06:49:46 PM
Hiya,

Does anyone perhaps know how to interrogate the CPU cache (control registers or other perhaps) or OS scheduler to
reveal the specific cache hit vs miss ratios.....Sorry if this has been asked and answered before.

Thanks
Raistlin
Title: Re: Cache again....
Post by: hutch-- on September 05, 2017, 07:15:43 PM
I could be wrong but I don't think there is such an animal. I think you have to do this the hard way, design the code according to the best theory you have then put it to the only test that matters, how FAST it is.
Title: Re: Cache again....
Post by: Raistlin on September 05, 2017, 07:24:22 PM
@Hutch, Agreed that would be the intent - however we know profiler applications exist that show cache miss/hit ratios for an application (RE: Intel, AMD, Microsoft)
therefore arguably you must be able to programatically read the cache hit vs miss data? Also we know there's API's to read for example process page faults. Thus my
investigation. Sorry if my logic is flawed and I'am blowing bubbles.
Title: Re: Cache again....
Post by: jj2007 on September 05, 2017, 07:31:42 PM
there's API's to read for example process page faults.

There is, there is:

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  Inkey "PageFaultCount: 0x", Hex$(MemState(PageFaultCount))
EndOfCode


Output: PageFaultCount: 0x00000572 (note the process hasn't done anything useful so far...)

Google for PROCESS_MEMORY_COUNTERS. Or check this (https://stackoverflow.com/questions/5447907/programmatically-counting-cache-faults):
Quote
Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Title: Re: Cache again....
Post by: Raistlin on September 05, 2017, 08:08:22 PM
@JJ2007 - Thanks man !

OK.....

Seems my investigations should move to the Model Specific Registers - re: RDMSR and WRMSR - let me read and try to make sense of it

EDIT :-> Correction - this is EXACTLY what I was looking for <-

Applies to INTEL
--------------------
 
http://datasheets.chipdb.org/Intel/x86/Pentium/Embedded%20Pentium%AE%20Processor/MDELREGS.PDF



Applies to AMD
---------------------

https://support.amd.com/TechDocs/25481.pdf
Title: Re: Cache again....
Post by: aw27 on September 05, 2017, 08:55:24 PM
RDMSR and WRMSR - let me read and try to make sense of it
They are privileged instruction in Windows.
Title: Re: Cache again....
Post by: Raistlin on September 05, 2017, 09:39:04 PM
Damnit all to hell - you are right aw27 - I need a kernel driver - damnit again
Title: Re: Cache again....
Post by: jj2007 on September 05, 2017, 10:00:46 PM
What really surprises me is that apparently nobody has written a little tool that does the job. Btw if that task is delegated to a driver, you'll get the cache misses & hits of the driver, I suppose... pretty useless :(
Title: Re: Cache again....
Post by: Raistlin on September 05, 2017, 10:39:35 PM
@jj2007 - as I understand it - yes and no - you should get the performance counters for the entire mixed workload over a set number of clock ticks.
Together with the API for page faults this would be useful I guess. Interesting prospect though.

Could be as simple as loading the driver - taking the system pulse - executing the app - checking the system pulse - and then whats the difference ?

The application profilers I spoke of previously, use mechanisms like "pseudo external debugger", "data-access stride checks"
or "simulated cache - re: testing of MSR registers" to make determinations (function/probabilistic statistical observations) for a specific application. 
Title: Re: Cache again....
Post by: hutch-- on September 06, 2017, 12:00:30 AM
Sad to say you will always have the problem of OS purpose specific mnemonics being ring0 only which means you cannot use them from ring3. Lack of access is also a security feature to prevent very dangerous hacks so I think you are stuck with indirect derivations to get data like cache hits and misses. I don't particularly bother with the OS specific instructions as I am not interested in drivers but it may be worth having a crawl around the Intel manuals to see if you can derive any data that is useful to you. CPUID is a strange mnemonic, its really badly written like the very old hardware guys used to write but it and a number of others like "rdtsc", "pause" and perhaps others that may shine a light on what you are after.
Title: Re: Cache again....
Post by: aw27 on September 06, 2017, 03:17:46 AM
@raistlin
This is source code patronized by Intel:
https://github.com/opcm/pcm
It may be an answer to what you want.  :idea:
Title: Re: Cache again....
Post by: nidud on September 06, 2017, 04:13:02 AM
Here's some JOB stuff.

;
; asmc -pe -D_WIN64 -D__PE__ $*.asm
;
include stdio.inc
include stdlib.inc
include winnt.inc
include winbase.inc

.code

main proc

    local ReturnLength:dword
    local JobObjectInfo:JOBOBJECT_BASIC_ACCOUNTING_INFORMATION (https://msdn.microsoft.com/en-us/library/windows/desktop/ms684143(v=vs.85).aspx)

    .if QueryInformationJobObject (https://msdn.microsoft.com/en-us/library/windows/desktop/ms684925(v=vs.85).aspx)(
            NULL,
            JobObjectBasicAccountingInformation (https://github.com/nidud/asmc/blob/master/include/winnt.inc#L6202),
            &JobObjectInfo,
            sizeof(JobObjectInfo),
            &ReturnLength)
            printf(
                "TotalUserTime:             %lld\n"
                "TotalKernelTime:           %lld\n"
                "ThisPeriodTotalUserTime:   %lld\n"
                "ThisPeriodTotalKernelTime: %lld\n"
                "TotalPageFaultCount:       %d\n"
                "TotalProcesses:            %d\n"
                "ActiveProcesses:           %d\n"
                "TotalTerminatedProcesses:  %d\n",
                JobObjectInfo.TotalUserTime.QuadPart,
                JobObjectInfo.TotalKernelTime.QuadPart,
                JobObjectInfo.ThisPeriodTotalUserTime.QuadPart,
                JobObjectInfo.ThisPeriodTotalKernelTime.QuadPart,
                JobObjectInfo.TotalPageFaultCount,
                JobObjectInfo.TotalProcesses,
                JobObjectInfo.ActiveProcesses,
                JobObjectInfo.TotalTerminatedProcesses)
    .endif
    exit(0)

main endp

    end main


GetLogicalProcessorInformation() sample to get some cache information (https://github.com/nidud/asmc/blob/master/include/winnt.inc#L6249): https://msdn.microsoft.com/en-us/library/windows/desktop/ms683194(v=vs.85).aspx

Code: [Select]
;
; asmc -pe -D_WIN64 -D__PE__ -Zp8 $*.asm
;
include windows.inc
include stdio.inc
include alloc.inc

LPFN_GLPI_T typedef proto WINAPI :PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, :LPDWORD
LPFN_GLPI   typedef ptr LPFN_GLPI_T

.code

;; Helper function to count set bits in the processor mask.
CountSetBits proc bitMask:ULONG_PTR

    LSHIFT = sizeof(ULONG_PTR)*8 - 1

    mov r8,1 shl LSHIFT
    xor eax,eax

    .for (edx = 0: edx <= LSHIFT: ++edx)

        .if (rcx & r8)
            inc eax
        .endif
        shr r8,1
    .endf
    ret

CountSetBits endp

main proc

    local glpi:LPFN_GLPI
    local done:BOOL
    local buffer:PSYSTEM_LOGICAL_PROCESSOR_INFORMATION
    local p:PSYSTEM_LOGICAL_PROCESSOR_INFORMATION
    local returnLength:DWORD
    local logicalProcessorCount:DWORD
    local numaNodeCount:DWORD
    local processorCoreCount:DWORD
    local processorL1CacheCount:DWORD
    local processorL2CacheCount:DWORD
    local processorL3CacheCount:DWORD
    local processorPackageCount:DWORD
    local byteOffset:DWORD
    local Cache:PCACHE_DESCRIPTOR
    local rc:DWORD
    local LineSize:DWORD

    lea rdi,byteOffset
    xor eax,eax
    mov ecx,14
    rep stosd

    .if !GetProcAddress(
            GetModuleHandle("kernel32"), "GetLogicalProcessorInformation")
        printf("\nGetLogicalProcessorInformation is not supported.\n")
        exit(1)
    .endif
    mov glpi,rax

    .while (!done)

        .if !glpi(buffer, &returnLength)

            .if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)

                .if (buffer)
                    free(buffer)
                .endif
                mov buffer,malloc(returnLength)

                .if (buffer == NULL)

                    printf("\nError: Allocation failure\n")
                    exit(2)
                .endif

            .else

                printf("\nError %d\n", GetLastError())
                exit(3)
            .endif

        .else

            mov done,TRUE
        .endif
    .endw

    mov rdi,buffer
    mov ebx,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)

    assume rdi:ptr SYSTEM_LOGICAL_PROCESSOR_INFORMATION

    .while (ebx <= returnLength)

        .switch ([rdi].Relationship)

        .case RelationNumaNode
            ;; Non-NUMA systems report a single record of this type.
            inc numaNodeCount
            .endc

        .case RelationProcessorCore
            inc processorCoreCount

            ;; A hyperthreaded core supplies more than one logical processor.
            CountSetBits([rdi].ProcessorMask)
            add logicalProcessorCount,eax
            .endc

        .case RelationCache
            ;; Cache data is in ptr->Cache, one CACHE_DESCRIPTOR structure for each cache.
            mov al,[rdi].Cache.Level
            .if (al == 1)
                inc processorL1CacheCount
            .elseif (al == 2)
                inc processorL2CacheCount
            .elseif (al == 3)
                inc processorL3CacheCount
            .endif
            movzx eax,[rdi].Cache.LineSize
            mov LineSize,eax
            .endc

        .case RelationProcessorPackage
            ;; Logical processors share a physical package.
            inc processorPackageCount
            .endc

        .default
            printf("\nError: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.\n")
            .endc
        .endsw
        add ebx,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)
        add rdi,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)
    .endw

    printf("\nGetLogicalProcessorInformation results:\n")
    printf("Number of NUMA nodes: %d\n", numaNodeCount)
    printf("Number of physical processor packages: %d\n",
             processorPackageCount)
    printf("Number of processor cores: %d\n",
             processorCoreCount)
    printf("Number of logical processors: %d\n",
             logicalProcessorCount)
    printf("Number of processor L1/L2/L3 caches: %d/%d/%d\n",
             processorL1CacheCount,
             processorL2CacheCount,
             processorL3CacheCount)
    printf("Cache LineSize: %d\n", LineSize)
    free(buffer)
    exit(0)

main endp

    end main
Title: Re: Cache again....
Post by: Siekmanski on September 06, 2017, 05:44:16 AM
It can be done, just found this tool on the net:

Perfmonitor 2 Processor performance and monitoring tool. https://www.cpuid.com/softwares/perfmonitor-2.html

Code: [Select]
Caches request rate and hit ratio

The cache requests rate is the ratio between the number of requests to that cache and the total number of instructions. The cache hit ratio is the ratio between the number of requests to the cache that resulted in a success (the required data was found in the cache) and the total number of requests to the cache.

Branch Instructions rate and branch hit ratio

Branch instructions rate is the ratio between the number of branch instructions (x86 jz/jnz/jg …) and the total number of instructions. The hit ratio reflects the performance of the branch prediction mechanism.

IPC/CPI

IPC stands for Instructions per clock, and refers to the ratio between the number of instructions retired and the total number of cycles, in other words the average number of instructions retired at every clock cycle. CPI, or Cycles per Instruction, is the invert of IPC.

MIPS/GIPS

Million instructions Per Second (MIPS) and Giga (billion) Instructions Per Second (GIPS) reflect the rate at which a CPU executes instructions.

Stalled cycles ratio

The stalled cycles refer to the clock cycles where no instruction was retired from the CPU pipeline. The ratio between the stalled cycles and the total cycles provides the stalled cycles ratio.

Unhalted clock cycles

The unhalted clock cycles count the cycles when the CPU is not in the halted state. When the processor is not in activity, it spends a lot of time in HLT state, and the unhalted clock cycles are very low. When the processor is at 100% load, the unhalted clock cycles show the current frequency, unless the CPU is throttling.

Usage

CPU usage reflects the CPU activity, as reported by Windows task manager. It can be assimilated to the ratio between unhalted clock cycles and the maximum processor frequency.
Title: Re: Cache again....
Post by: jj2007 on September 06, 2017, 08:28:07 AM
It can be done, just found this tool on the net:

Perfmonitor 2 Processor performance and monitoring tool. https://www.cpuid.com/softwares/perfmonitor-2.html

Looks very interesting, thanks. No pricing info, so let's assume it is freeware...
Title: Re: Cache again....
Post by: Siekmanski on September 06, 2017, 01:52:30 PM
Intel® VTune™ Amplifier 2017 $899  :biggrin:
https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy#buynow
Title: Re: Cache again....
Post by: Raistlin on September 06, 2017, 03:26:29 PM
@jj2007 = thanks again, you rock (re: Intel, OpenLibSys.org source)

@Nidud = you nailed the OS scheduler bit - thanks ! - I'am unsure however what they mean by page fault...
            - seems Microsoft uses the term loosely to describe issues with virtual memory, re: hard/soft/cache faults combined?
            - But I could be wrong, will investigate.

@Everyone_else = ....or we could just write a kernel driver and get it over with ? I found the kernel driver tutorials on the old masmforum site,
                            hope they are still relevant. Supposedly there's a trick to include the driver inside your exe, thereby not having to install it.
                            CrystalCpuID and others use this trick. Just for interest sake -> licensing ($$$) is required for all the other suggestions. Besides
                            it goes against my moral fiber to pay for something I can do myself. Then there's also religion:  we can write it better, faster, smoother
                            and more flexible than anyone else using assembly :bgrin:
Title: Re: Cache again....
Post by: aw27 on September 06, 2017, 06:05:24 PM
Supposedly there's a trick to include the driver inside your exe
That is not the most difficult part of the project.  :biggrin:
It is just placed inside the file as a resource, or as an included file, then extracted and placed where needed. Many applications do it, namely well know Sysinternal applications like DbgView.exe.
Title: Re: Cache again....
Post by: Raistlin on September 06, 2017, 06:17:31 PM
Code: [Select]
That is not the most difficult part of the project.  :biggrin:
per AW27 = I'am getting the feeling, I don't know what I'am letting myself in for.

But then again it wouldn't be the first time............. you have to be a little crazy to undertake large
projects that have indistinct outcomes at making a difference. The belief is - it's worth it to find out.
Title: Re: Cache again....
Post by: jj2007 on September 06, 2017, 06:20:36 PM
or we could just write a kernel driver

I vaguely remember that recent Windows versions install only signed drivers; where "signed" translates to a lot of money :(

Kernel-Mode Code Signing Requirements (https://docs.microsoft.com/en-us/windows-hardware/drivers/install/kernel-mode-code-signing-requirements--windows-vista-and-later-)

For developers:
Quote
A kernel-mode driver that is not a boot-start driver must have either a test-signed catalog file (https://docs.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files) or the driver file must include an embedded test signature. This applies to any type of PnP or non-PnP kernel-mode driver.
Title: Re: Cache again....
Post by: aw27 on September 06, 2017, 06:28:01 PM
or we could just write a kernel driver

I vaguely remember that recent Windows versions install only signed drivers; where "signed" translates to a lot of money :(

Kernel-Mode Code Signing Requirements (https://docs.microsoft.com/en-us/windows-hardware/drivers/install/kernel-mode-code-signing-requirements--windows-vista-and-later-)
Not only signed drivers, class 3 signed drivers from only a restrict number of authorities accredited by Microsoft.
I actually have one of those certificates, but have not paid a lot of money because there are opportunities for big discounts.
Title: Re: Cache again....
Post by: Raistlin on September 07, 2017, 03:46:42 PM
After doing some serious reading - I'am abandoning the kernel mode driver idea  :(
It's just too convoluted and takes the fun right out of it. What a bunch of B/S just
to get to one OS performance privileged instruction. What a shame....
For those interested in the amount of hoops you need to jump through, here's a taste:
http://www.davidegrayson.com/signing/

Any idea in regards to an API that exposes the performance metrics - documented or not ?
The following looks promising : https://msdn.microsoft.com/en-us/library/windows/desktop/ms724509%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Title: Re: Cache again....
Post by: aw27 on September 07, 2017, 04:44:30 PM
http://www.davidegrayson.com/signing/
:biggrin:

Well, the article about driver signing contains lots of bullshit and actually Windows 7 supports SHA-2 signing.

Quote
Any idea in regards to an API that exposes the performance metrics - documented or not ?

The NtQuerySystemInformation contains many undocumented classes, but you can grab some information in the Net, namely here (https://www.geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/secureboot.htm).


Title: Re: Cache again....
Post by: Raistlin on September 07, 2017, 04:56:39 PM
Thanks AW27 - you've been a real   :t help and diplomatic in your responses