Print Page - How to Benchmark Code Execution Times

Title: How to Benchmark Code Execution Times
Post by: HSE on May 30, 2022, 12:42:59 AM

Hi all!

Following Gabriele Paoloni, 2010: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)) I adapted to UEFI the measurement core:

Code (The glass) Select

            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
            mov Old_TPL, rax
            CPUID
            RDTSC
            mov cycles_high , rdx
            mov cycles_low  , rax
;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RestoreTPL, Old_TPL

One Paoloni trick is to call (or not) a loop function then you can make estimations from minimum values using lineal regression:

Code Select

measured_loop proc loops:qword
    local i : qword

    ForLp i, 0, loops
        function_under_glass1
    Next i
    ret
measured_loop endp

The target code for the test is that used by Gunther in Count cycles: test reports needed (http://masm32.com/board/index.php?topic=10087.0):

Code Select

function_under_glass1 macro
    fld      qword ptr [numerator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fstp     qword ptr [result]
endm

To run the test, PC booted from a FAT32 formated USB. *.efi files must be renamed to BOOTX64.efi and copied to /efi/boot/ directory in USB.

For first step the program is assembled with invocation to measured_loop commented (BenchmarkUEFI0.efi):

Code Select

      total number of spurious min values =       240
      total variance             = 0.5871
      absolute max deviation     = 8.000
      variance of variances      = 0.0002

lineal equation: total cycles = 30.8239 - 0.001 * repetitions

For second step the program is assembled making the invocation to measured_loop (BenchmarkUEFI2.efi):

Code Select

      total number of spurious min values =       1
      total variance             = 187.5673
      absolute max deviation     = 82.000
      variance of variances      = 25560.41

lineal equation: total cycles = 75.6068 + 18.0053 * repetitions

From these equations and rounding values you expect in this machine:

31 cycles overhead of measurement
45 cycles overhead of calling the loop function
18 cycles target code execution
----------- ------------------------------------------
94 cycles total measurement of single execution

Regards, HSE.

Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on May 30, 2022, 01:51:14 AM

Hi HSE
That's a really interesting idea. :thumbsup:
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.

Biterider

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on May 30, 2022, 02:57:54 AM

Hi Biterider

Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.

My idea was to test applications running from UEFI, then is the only option :biggrin:

Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.

:thumbsup:
This kind of timing in VirtualBox from USB it's a disaster. Using the clock for long processes look good.
To obtain reliable results this programs must run rebooting the PC from the USB.

HSE

Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on June 14, 2022, 05:29:19 AM

Hi HSE
I'm trying to build your Benchmark UEFI application, but the PrintC macro seems to be missing.

Regards, Biterider

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on June 14, 2022, 07:21:36 AM

Hi Biterider!!

PrintC is in efiutil.inc (https://github.com/ASMHSE/ObjAsm-C.1uefi/blob/f5f39ebb55a1c1ebf92feeab7b7b9d52e51ff7c0/Code/Inc/UEFI/efiUtil.inc)

Regards, HSE

Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on July 01, 2022, 03:29:56 AM

Hi
I've made some progress on this topic and discovered some very useful things.
The Paoloni approach is pure gold (if done right). HSE explained it very well in the first post. Reading Paoloni's paper is highly recommended.

I created some tests with known results to track down some bugs and after fixing them I switched back to code timing. Comparing the result with the UEFI approach, I have a very close match with Agner Fog timing tables for my CPU model (+/-0.2 cycles max).

In order to debug the code with my usual debugger, I adapted the code to run on Windows. To my surprise, although the variance has increased noticeably, the results are also pretty good. This is really good news. This means that for non-ultra precise measurements (+/-1 cycle), this implementation can also be used on Windows. :thumbsup:

I'll refine the code and post it in the next days. At the moment it outputs the results to the DebugCenter.

Biterider

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 06:32:48 AM

Congratulations :thumbsup:

Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 06:40:12 AM

Quote from: HSE on May 30, 2022, 12:42:59 AM
Code (The glass) Select Expand
mov r10, pBootServices invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL mov Old_TPL, rax CPUID RDTSC mov cycles_high , rdx mov cycles_low , rax

That looks somewhat similar to my CyCt* macros:

Code Select

	push ebx
	cpuid		; serialize; trashes ecx+edi
	pop ebx
	rdtsc
	mov CyCtrdtsc, eax		; these cost a few cycles
	mov CyCtrdtsc[4], edx		; that we will subtract later

I doubt, however, that you'll get stable timings from that - it's more complicated.

Btw...

Code Select

            mov cycles_high , edx
            mov cycles_low  , eax

...would be more consistent. The upper DWORDs aren't affected by rdtsc.

Title: Re: How to Benchmark Code Execution Times
Post by: Biterider on July 01, 2022, 06:52:07 AM

Hi HSE

Quote from: HSE on July 01, 2022, 06:32:48 AM
Congratulations :thumbsup:

You deserve it. You found it and it was your idea to combine UEFI & Paoloni. :cool:

Biterider

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:17:06 AM

Hi JJ!

CyCt macros differ from Paoloni code after target code, wich is:

Code Select

;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID

Quote from: jj2007 on July 01, 2022, 06:40:12 AM
The upper DWORDs aren't affected by rdtsc.

In 64 bits, upper DWORD is cleared.

I think that kind of address in CyCt is not going to work with largeaddressaware.

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:21:56 AM

Hi Biterider!

Quote from: Biterider on July 01, 2022, 06:52:07 AM
You deserve it.

I failed into find differences between my code and what you posted in repository, and I can't maked that work :biggrin:

Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 07:36:34 AM

Quote from: HSE on July 01, 2022, 07:17:06 AMIn 64 bits, upper DWORD is cleared.

Yes, that's why you should ideally use
RDTSCP
shl rdx, 32
or rax, rdx
mov cycles, rax ; a QWORD
CPUID

Now one question is "why cpuid after rdtsc?" I've seen this also in a SOF answer:

QuoteI suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need both rdtsc + cpuid and rdtscp + cpuid for correct measurements) –
Necrolis
Sep 28, 2012 at 7:25

But it doesn't make sense to me: why use the serialising instruction after the measurement?

Btw this is not to criticise your code, which is very good - I am trying to find the one correct solution for my 32-bit CyCt* macros. Even at SOF they are struggling (https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc) :cool:

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 07:49:50 AM

Quote from: jj2007 on July 01, 2022, 07:36:34 AM
But it doesn't make sense to me: why use the serialising instruction after the measurement?

Because in the end your are also serialising RDTSCP. Still can happen that RDTSCP is executed after "mov cycles_high1 , rdx; mov cycles_low1 , rax" and you obtain an spurius value, but probability is lower. That was Paoloni work, less spurious values were obtained in that way.

Quote from: jj2007 on July 01, 2022, 07:36:34 AM
RDTSCP
shl rdx, 32
or rax, rdx
mov cycles, rax ; a QWORD
CPUID

That is one more instruction inside the critical code (I called that The glass). (for magnification glass, you know :biggrin:)

Title: Re: How to Benchmark Code Execution Times
Post by: jj2007 on July 01, 2022, 08:58:41 AM

Quote from: HSE on July 01, 2022, 07:49:50 AMThat is one more instruction inside the critical code

But one less memory move...

Title: Re: How to Benchmark Code Execution Times
Post by: HSE on July 01, 2022, 09:12:34 AM

Quote from: jj2007 on July 01, 2022, 08:58:41 AM
But one less memory move...

This is technology, not science. It's not much about logic than about measured deviations. You can test if that result in less deviations, perhaps you are wright :thumbsup:

The MASM Forum

Projects => ObjAsm => Topic started by: HSE on May 30, 2022, 12:42:59 AM