Author Topic: Cache again....  (Read 333 times)

Raistlin

  • Member
  • **
  • Posts: 238
Cache again....
« on: September 05, 2017, 06:49:46 PM »
Hiya,

Does anyone perhaps know how to interrogate the CPU cache (control registers or other perhaps) or OS scheduler to
reveal the specific cache hit vs miss ratios.....Sorry if this has been asked and answered before.

Thanks
Raistlin

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4809
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Cache again....
« Reply #1 on: September 05, 2017, 07:15:43 PM »
I could be wrong but I don't think there is such an animal. I think you have to do this the hard way, design the code according to the best theory you have then put it to the only test that matters, how FAST it is.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

Raistlin

  • Member
  • **
  • Posts: 238
Re: Cache again....
« Reply #2 on: September 05, 2017, 07:24:22 PM »
@Hutch, Agreed that would be the intent - however we know profiler applications exist that show cache miss/hit ratios for an application (RE: Intel, AMD, Microsoft)
therefore arguably you must be able to programatically read the cache hit vs miss data? Also we know there's API's to read for example process page faults. Thus my
investigation. Sorry if my logic is flawed and I'am blowing bubbles.

jj2007

  • Member
  • *****
  • Posts: 7543
  • Assembler is fun ;-)
    • MasmBasic
Re: Cache again....
« Reply #3 on: September 05, 2017, 07:31:42 PM »
there's API's to read for example process page faults.

There is, there is:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  Inkey "PageFaultCount: 0x", Hex$(MemState(PageFaultCount))
EndOfCode


Output: PageFaultCount: 0x00000572 (note the process hasn't done anything useful so far...)

Google for PROCESS_MEMORY_COUNTERS. Or check this:
Quote
Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.

Raistlin

  • Member
  • **
  • Posts: 238
Re: Cache again....
« Reply #4 on: September 05, 2017, 08:08:22 PM »
@JJ2007 - Thanks man !

OK.....

Seems my investigations should move to the Model Specific Registers - re: RDMSR and WRMSR - let me read and try to make sense of it

EDIT :-> Correction - this is EXACTLY what I was looking for <-

Applies to INTEL
--------------------
 
http://datasheets.chipdb.org/Intel/x86/Pentium/Embedded%20Pentium%AE%20Processor/MDELREGS.PDF



Applies to AMD
---------------------

https://support.amd.com/TechDocs/25481.pdf

aw27

  • Member
  • ****
  • Posts: 699
Re: Cache again....
« Reply #5 on: September 05, 2017, 08:55:24 PM »
RDMSR and WRMSR - let me read and try to make sense of it
They are privileged instruction in Windows.

Raistlin

  • Member
  • **
  • Posts: 238
Re: Cache again....
« Reply #6 on: September 05, 2017, 09:39:04 PM »
Damnit all to hell - you are right aw27 - I need a kernel driver - damnit again

jj2007

  • Member
  • *****
  • Posts: 7543
  • Assembler is fun ;-)
    • MasmBasic
Re: Cache again....
« Reply #7 on: September 05, 2017, 10:00:46 PM »
What really surprises me is that apparently nobody has written a little tool that does the job. Btw if that task is delegated to a driver, you'll get the cache misses & hits of the driver, I suppose... pretty useless :(

Raistlin

  • Member
  • **
  • Posts: 238
Re: Cache again....
« Reply #8 on: September 05, 2017, 10:39:35 PM »
@jj2007 - as I understand it - yes and no - you should get the performance counters for the entire mixed workload over a set number of clock ticks.
Together with the API for page faults this would be useful I guess. Interesting prospect though.

Could be as simple as loading the driver - taking the system pulse - executing the app - checking the system pulse - and then whats the difference ?

The application profilers I spoke of previously, use mechanisms like "pseudo external debugger", "data-access stride checks"
or "simulated cache - re: testing of MSR registers" to make determinations (function/probabilistic statistical observations) for a specific application. 

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4809
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Cache again....
« Reply #9 on: September 06, 2017, 12:00:30 AM »
Sad to say you will always have the problem of OS purpose specific mnemonics being ring0 only which means you cannot use them from ring3. Lack of access is also a security feature to prevent very dangerous hacks so I think you are stuck with indirect derivations to get data like cache hits and misses. I don't particularly bother with the OS specific instructions as I am not interested in drivers but it may be worth having a crawl around the Intel manuals to see if you can derive any data that is useful to you. CPUID is a strange mnemonic, its really badly written like the very old hardware guys used to write but it and a number of others like "rdtsc", "pause" and perhaps others that may shine a light on what you are after.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

aw27

  • Member
  • ****
  • Posts: 699
Re: Cache again....
« Reply #10 on: September 06, 2017, 03:17:46 AM »
@raistlin
This is source code patronized by Intel:
https://github.com/opcm/pcm
It may be an answer to what you want.  :idea:

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Cache again....
« Reply #11 on: September 06, 2017, 04:13:02 AM »
Here's some JOB stuff.

;
; asmc -pe -D_WIN64 -D__PE__ $*.asm
;
include stdio.inc
include stdlib.inc
include winnt.inc
include winbase.inc

.code

main proc

    local ReturnLength:dword
    local JobObjectInfo:JOBOBJECT_BASIC_ACCOUNTING_INFORMATION

    .if QueryInformationJobObject(
            NULL,
            JobObjectBasicAccountingInformation,
            &JobObjectInfo,
            sizeof(JobObjectInfo),
            &ReturnLength)
            printf(
                "TotalUserTime:             %lld\n"
                "TotalKernelTime:           %lld\n"
                "ThisPeriodTotalUserTime:   %lld\n"
                "ThisPeriodTotalKernelTime: %lld\n"
                "TotalPageFaultCount:       %d\n"
                "TotalProcesses:            %d\n"
                "ActiveProcesses:           %d\n"
                "TotalTerminatedProcesses:  %d\n",
                JobObjectInfo.TotalUserTime.QuadPart,
                JobObjectInfo.TotalKernelTime.QuadPart,
                JobObjectInfo.ThisPeriodTotalUserTime.QuadPart,
                JobObjectInfo.ThisPeriodTotalKernelTime.QuadPart,
                JobObjectInfo.TotalPageFaultCount,
                JobObjectInfo.TotalProcesses,
                JobObjectInfo.ActiveProcesses,
                JobObjectInfo.TotalTerminatedProcesses)
    .endif
    exit(0)

main endp

    end main


GetLogicalProcessorInformation() sample to get some cache information: https://msdn.microsoft.com/en-us/library/windows/desktop/ms683194(v=vs.85).aspx

Code: [Select]
;
; asmc -pe -D_WIN64 -D__PE__ -Zp8 $*.asm
;
include windows.inc
include stdio.inc
include alloc.inc

LPFN_GLPI_T typedef proto WINAPI :PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, :LPDWORD
LPFN_GLPI   typedef ptr LPFN_GLPI_T

.code

;; Helper function to count set bits in the processor mask.
CountSetBits proc bitMask:ULONG_PTR

    LSHIFT = sizeof(ULONG_PTR)*8 - 1

    mov r8,1 shl LSHIFT
    xor eax,eax

    .for (edx = 0: edx <= LSHIFT: ++edx)

        .if (rcx & r8)
            inc eax
        .endif
        shr r8,1
    .endf
    ret

CountSetBits endp

main proc

    local glpi:LPFN_GLPI
    local done:BOOL
    local buffer:PSYSTEM_LOGICAL_PROCESSOR_INFORMATION
    local p:PSYSTEM_LOGICAL_PROCESSOR_INFORMATION
    local returnLength:DWORD
    local logicalProcessorCount:DWORD
    local numaNodeCount:DWORD
    local processorCoreCount:DWORD
    local processorL1CacheCount:DWORD
    local processorL2CacheCount:DWORD
    local processorL3CacheCount:DWORD
    local processorPackageCount:DWORD
    local byteOffset:DWORD
    local Cache:PCACHE_DESCRIPTOR
    local rc:DWORD
    local LineSize:DWORD

    lea rdi,byteOffset
    xor eax,eax
    mov ecx,14
    rep stosd

    .if !GetProcAddress(
            GetModuleHandle("kernel32"), "GetLogicalProcessorInformation")
        printf("\nGetLogicalProcessorInformation is not supported.\n")
        exit(1)
    .endif
    mov glpi,rax

    .while (!done)

        .if !glpi(buffer, &returnLength)

            .if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)

                .if (buffer)
                    free(buffer)
                .endif
                mov buffer,malloc(returnLength)

                .if (buffer == NULL)

                    printf("\nError: Allocation failure\n")
                    exit(2)
                .endif

            .else

                printf("\nError %d\n", GetLastError())
                exit(3)
            .endif

        .else

            mov done,TRUE
        .endif
    .endw

    mov rdi,buffer
    mov ebx,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)

    assume rdi:ptr SYSTEM_LOGICAL_PROCESSOR_INFORMATION

    .while (ebx <= returnLength)

        .switch ([rdi].Relationship)

        .case RelationNumaNode
            ;; Non-NUMA systems report a single record of this type.
            inc numaNodeCount
            .endc

        .case RelationProcessorCore
            inc processorCoreCount

            ;; A hyperthreaded core supplies more than one logical processor.
            CountSetBits([rdi].ProcessorMask)
            add logicalProcessorCount,eax
            .endc

        .case RelationCache
            ;; Cache data is in ptr->Cache, one CACHE_DESCRIPTOR structure for each cache.
            mov al,[rdi].Cache.Level
            .if (al == 1)
                inc processorL1CacheCount
            .elseif (al == 2)
                inc processorL2CacheCount
            .elseif (al == 3)
                inc processorL3CacheCount
            .endif
            movzx eax,[rdi].Cache.LineSize
            mov LineSize,eax
            .endc

        .case RelationProcessorPackage
            ;; Logical processors share a physical package.
            inc processorPackageCount
            .endc

        .default
            printf("\nError: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.\n")
            .endc
        .endsw
        add ebx,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)
        add rdi,sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)
    .endw

    printf("\nGetLogicalProcessorInformation results:\n")
    printf("Number of NUMA nodes: %d\n", numaNodeCount)
    printf("Number of physical processor packages: %d\n",
             processorPackageCount)
    printf("Number of processor cores: %d\n",
             processorCoreCount)
    printf("Number of logical processors: %d\n",
             logicalProcessorCount)
    printf("Number of processor L1/L2/L3 caches: %d/%d/%d\n",
             processorL1CacheCount,
             processorL2CacheCount,
             processorL3CacheCount)
    printf("Cache LineSize: %d\n", LineSize)
    free(buffer)
    exit(0)

main endp

    end main

Siekmanski

  • Member
  • *****
  • Posts: 1089
Re: Cache again....
« Reply #12 on: September 06, 2017, 05:44:16 AM »
It can be done, just found this tool on the net:

Perfmonitor 2 Processor performance and monitoring tool. https://www.cpuid.com/softwares/perfmonitor-2.html

Code: [Select]
Caches request rate and hit ratio

The cache requests rate is the ratio between the number of requests to that cache and the total number of instructions. The cache hit ratio is the ratio between the number of requests to the cache that resulted in a success (the required data was found in the cache) and the total number of requests to the cache.

Branch Instructions rate and branch hit ratio

Branch instructions rate is the ratio between the number of branch instructions (x86 jz/jnz/jg …) and the total number of instructions. The hit ratio reflects the performance of the branch prediction mechanism.

IPC/CPI

IPC stands for Instructions per clock, and refers to the ratio between the number of instructions retired and the total number of cycles, in other words the average number of instructions retired at every clock cycle. CPI, or Cycles per Instruction, is the invert of IPC.

MIPS/GIPS

Million instructions Per Second (MIPS) and Giga (billion) Instructions Per Second (GIPS) reflect the rate at which a CPU executes instructions.

Stalled cycles ratio

The stalled cycles refer to the clock cycles where no instruction was retired from the CPU pipeline. The ratio between the stalled cycles and the total cycles provides the stalled cycles ratio.

Unhalted clock cycles

The unhalted clock cycles count the cycles when the CPU is not in the halted state. When the processor is not in activity, it spends a lot of time in HLT state, and the unhalted clock cycles are very low. When the processor is at 100% load, the unhalted clock cycles show the current frequency, unless the CPU is throttling.

Usage

CPU usage reflects the CPU activity, as reported by Windows task manager. It can be assimilated to the ratio between unhalted clock cycles and the maximum processor frequency.

jj2007

  • Member
  • *****
  • Posts: 7543
  • Assembler is fun ;-)
    • MasmBasic
Re: Cache again....
« Reply #13 on: September 06, 2017, 08:28:07 AM »
It can be done, just found this tool on the net:

Perfmonitor 2 Processor performance and monitoring tool. https://www.cpuid.com/softwares/perfmonitor-2.html

Looks very interesting, thanks. No pricing info, so let's assume it is freeware...

Siekmanski

  • Member
  • *****
  • Posts: 1089
Re: Cache again....
« Reply #14 on: September 06, 2017, 01:52:30 PM »