The MASM Forum

Projects => ObjAsm => Topic started by: HSE on May 30, 2022, 12:42:59 AM

Title: How to Benchmark Code Execution Times
Post by: HSE on May 30, 2022, 12:42:59 AM
Hi all!

Following Gabriele Paoloni, 2010: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)) I adapted to UEFI the measurement core:

Code: (The glass) [Select]
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
            mov Old_TPL, rax
            CPUID
            RDTSC
            mov cycles_high , rdx
            mov cycles_low  , rax
;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RestoreTPL, Old_TPL

One Paoloni trick is to call (or not) a loop function then you can make estimations from minimum values using lineal regression:
Code: [Select]
measured_loop proc loops:qword
    local i : qword

    ForLp i, 0, loops
        function_under_glass1
    Next i
    ret
measured_loop endp

The target code for the test is that used by Gunther in Count cycles: test reports needed (http://masm32.com/board/index.php?topic=10087.0):
Code: [Select]
function_under_glass1 macro
    fld      qword ptr [numerator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fstp     qword ptr [result]
endm

To run the test, PC booted from a FAT32 formated USB. *.efi files must be renamed to BOOTX64.efi and  copied to /efi/boot/ directory in USB.

For first step the program is assembled with invocation to measured_loop commented (BenchmarkUEFI0.efi):
Code: [Select]
      total number of spurious min values =       240
      total variance             = 0.5871
      absolute max deviation     = 8.000
      variance of variances      = 0.0002

            lineal equation:      total cycles = 30.8239 - 0.001 * repetitions

For second step the program is assembled making the invocation to measured_loop (BenchmarkUEFI2.efi):
Code: [Select]
      total number of spurious min values =       1
      total variance             = 187.5673
      absolute max deviation     = 82.000
      variance of variances      = 25560.41

            lineal equation:      total cycles = 75.6068 + 18.0053 * repetitions

From these equations and rounding values you expect in this machine:

            31 cycles           overhead of measurement
            45 cycles           overhead of calling the loop function
            18 cycles           target code execution
            -----------           ------------------------------------------
            94 cycles           total measurement of single execution

Regards, HSE.
Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on May 30, 2022, 01:51:14 AM
Hi HSE
That's a really interesting idea.  :thumbsup:
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.

Biterider
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on May 30, 2022, 02:57:54 AM
Hi Biterider

I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
My idea was to test applications running from UEFI, then is the only option  :biggrin:

I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.
:thumbsup:
This kind of timing in VirtualBox from USB it's a disaster. Using the clock for long processes look good.
To obtain reliable results this programs must run rebooting the PC from the USB. 

HSE
Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on June 14, 2022, 05:29:19 AM
Hi HSE
I'm trying to build your Benchmark UEFI application, but the PrintC macro seems to be missing.


Regards, Biterider
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on June 14, 2022, 07:21:36 AM
Hi Biterider!!

PrintC is in efiutil.inc (https://github.com/ASMHSE/ObjAsm-C.1uefi/blob/f5f39ebb55a1c1ebf92feeab7b7b9d52e51ff7c0/Code/Inc/UEFI/efiUtil.inc)

Regards, HSE
Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on July 01, 2022, 03:29:56 AM
Hi
I've made some progress on this topic and discovered some very useful things.
The Paolini approach is pure gold (if done right). HSE explained it very well in the first post. Reading Paolini's paper is highly recommended.

I created some tests with known results to track down some bugs and after fixing them I switched back to code timing. Comparing the result with the UEFI approach, I have a very close match with Agner Fog timing tables for my CPU model (+/-0.2 cycles max).

In order to debug the code with my usual debugger, I adapted the code to run on Windows. To my surprise, although the variance has increased noticeably, the results are also pretty good. This is really good news. This means that for non-ultra precise measurements (+/-1 cycle), this implementation can also be used on Windows.  :thumbsup:

I'll refine the code and post it in the next days. At the moment it outputs the results to the DebugCenter.

Biterider
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 06:32:48 AM
Congratulations  :thumbsup:
Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 06:40:12 AM
Code: (The glass) [Select]
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
            mov Old_TPL, rax
            CPUID
            RDTSC
            mov cycles_high , rdx
            mov cycles_low  , rax

That looks somewhat similar to my CyCt* macros:
Code: [Select]
push ebx
cpuid ; serialize; trashes ecx+edi
pop ebx
rdtsc
mov CyCtrdtsc, eax ; these cost a few cycles
mov CyCtrdtsc[4], edx ; that we will subtract later

I doubt, however, that you'll get stable timings from that - it's more complicated.

Btw...
Code: [Select]
            mov cycles_high , edx
            mov cycles_low  , eax
...would be more consistent. The upper DWORDs aren't affected by rdtsc.
Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on July 01, 2022, 06:52:07 AM

Hi HSE
Congratulations  :thumbsup:
You deserve it. You found it and it was your idea to combine UEFI & Paolini.  :cool:


Biterider
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:17:06 AM
Hi JJ!

CyCt macros differ from Paoloni code after target code, wich is:
Code: [Select]
;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID
 
The upper DWORDs aren't affected by rdtsc.

In 64 bits, upper DWORD is cleared.

I think that kind of address in CyCt is not going to work with largeaddressaware.
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:21:56 AM
Hi Biterider!

You deserve it.

I failed into find differences between my code and what you posted in repository, and I can't maked that work  :biggrin:
Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 07:36:34 AM
In 64 bits, upper DWORD is cleared.

Yes, that's why you should ideally use
            RDTSCP
            shl rdx, 32
            or rax, rdx
            mov cycles, rax   ; a QWORD
            CPUID

Now one question is "why cpuid after rdtsc?" I've seen this also in a SOF answer:
Quote
I suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need both rdtsc + cpuid and rdtscp + cpuid for correct measurements) –
Necrolis
Sep 28, 2012 at 7:25

But it doesn't make sense to me: why use the serialising instruction after the measurement?

Btw this is not to criticise your code, which is very good - I am trying to find the one correct solution for my 32-bit CyCt* macros. Even at SOF they are struggling (https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc) :cool:
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:49:50 AM
But it doesn't make sense to me: why use the serialising instruction after the measurement?

Because in the end your are also serialising RDTSCP. Still can happen that RDTSCP is executed after "mov cycles_high1 , rdx; mov cycles_low1  , rax" and you obtain an spurius value, but probability is lower. That was Paoloni work, less spurious values were obtained in that way.

            RDTSCP
            shl rdx, 32
            or rax, rdx
            mov cycles, rax   ; a QWORD
            CPUID

That is one more instruction inside the critical code (I called that The glass). (for magnification glass, you know  :biggrin:)
Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 08:58:41 AM
That is one more instruction inside the critical code

But one less memory move...
Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 09:12:34 AM
But one less memory move...

This is technology, not science. It's not much about logic than about measured deviations. You can test if that result in less deviations, perhaps you are wright  :thumbsup: