News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

How to Benchmark Code Execution Times

Started by HSE, May 30, 2022, 12:42:59 AM

Previous topic - Next topic

HSE

Hi all!

Following Gabriele Paoloni, 2010: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf) I adapted to UEFI the measurement core:

Code (The glass) Select
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
            mov Old_TPL, rax
            CPUID
            RDTSC
            mov cycles_high , rdx
            mov cycles_low  , rax
;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RestoreTPL, Old_TPL


One Paoloni trick is to call (or not) a loop function then you can make estimations from minimum values using lineal regression:
measured_loop proc loops:qword
    local i : qword

    ForLp i, 0, loops
        function_under_glass1
    Next i
    ret
measured_loop endp


The target code for the test is that used by Gunther in Count cycles: test reports needed:function_under_glass1 macro
    fld      qword ptr [numerator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fdiv     qword ptr [denominator]
    fstp     qword ptr [result]
endm


To run the test, PC booted from a FAT32 formated USB. *.efi files must be renamed to BOOTX64.efi and  copied to /efi/boot/ directory in USB.

For first step the program is assembled with invocation to measured_loop commented (BenchmarkUEFI0.efi):
      total number of spurious min values =       240
      total variance             = 0.5871
      absolute max deviation     = 8.000
      variance of variances      = 0.0002


            lineal equation:      total cycles = 30.8239 - 0.001 * repetitions

For second step the program is assembled making the invocation to measured_loop (BenchmarkUEFI2.efi):
      total number of spurious min values =       1
      total variance             = 187.5673
      absolute max deviation     = 82.000
      variance of variances      = 25560.41


            lineal equation:      total cycles = 75.6068 + 18.0053 * repetitions

From these equations and rounding values you expect in this machine:

            31 cycles           overhead of measurement
            45 cycles           overhead of calling the loop function
            18 cycles           target code execution
            -----------           ------------------------------------------
            94 cycles           total measurement of single execution

Regards, HSE.
Equations in Assembly: SmplMath

Biterider

Hi HSE
That's a really interesting idea.  :thumbsup:
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.

Biterider

HSE

Hi Biterider

Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
My idea was to test applications running from UEFI, then is the only option  :biggrin:

Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.
:thumbsup:
This kind of timing in VirtualBox from USB it's a disaster. Using the clock for long processes look good.
To obtain reliable results this programs must run rebooting the PC from the USB. 

HSE
Equations in Assembly: SmplMath

Biterider

Hi HSE
I'm trying to build your Benchmark UEFI application, but the PrintC macro seems to be missing.


Regards, Biterider

HSE

Equations in Assembly: SmplMath

Biterider

#5
Hi
I've made some progress on this topic and discovered some very useful things.
The Paoloni approach is pure gold (if done right). HSE explained it very well in the first post. Reading Paoloni's paper is highly recommended.

I created some tests with known results to track down some bugs and after fixing them I switched back to code timing. Comparing the result with the UEFI approach, I have a very close match with Agner Fog timing tables for my CPU model (+/-0.2 cycles max).

In order to debug the code with my usual debugger, I adapted the code to run on Windows. To my surprise, although the variance has increased noticeably, the results are also pretty good. This is really good news. This means that for non-ultra precise measurements (+/-1 cycle), this implementation can also be used on Windows.  :thumbsup:

I'll refine the code and post it in the next days. At the moment it outputs the results to the DebugCenter.

Biterider

HSE

Equations in Assembly: SmplMath

jj2007

Quote from: HSE on May 30, 2022, 12:42:59 AM
Code (The glass) Select
            mov r10, pBootServices
            invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
            mov Old_TPL, rax
            CPUID
            RDTSC
            mov cycles_high , rdx
            mov cycles_low  , rax

That looks somewhat similar to my CyCt* macros:
push ebx
cpuid ; serialize; trashes ecx+edi
pop ebx
rdtsc
mov CyCtrdtsc, eax ; these cost a few cycles
mov CyCtrdtsc[4], edx ; that we will subtract later


I doubt, however, that you'll get stable timings from that - it's more complicated.

Btw...
            mov cycles_high , edx
            mov cycles_low  , eax

...would be more consistent. The upper DWORDs aren't affected by rdtsc.

Biterider

#8

Hi HSE
Quote from: HSE on July 01, 2022, 06:32:48 AM
Congratulations  :thumbsup:
You deserve it. You found it and it was your idea to combine UEFI & Paoloni.  :cool:


Biterider

HSE

Hi JJ!

CyCt macros differ from Paoloni code after target code, wich is:;            invoke measured_loop, j
            RDTSCP
            mov cycles_high1 , rdx
            mov cycles_low1  , rax
            CPUID

 
Quote from: jj2007 on July 01, 2022, 06:40:12 AM
The upper DWORDs aren't affected by rdtsc.

In 64 bits, upper DWORD is cleared.

I think that kind of address in CyCt is not going to work with largeaddressaware.
Equations in Assembly: SmplMath

HSE

Hi Biterider!

Quote from: Biterider on July 01, 2022, 06:52:07 AM
You deserve it.

I failed into find differences between my code and what you posted in repository, and I can't maked that work  :biggrin:
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on July 01, 2022, 07:17:06 AMIn 64 bits, upper DWORD is cleared.

Yes, that's why you should ideally use
            RDTSCP
            shl rdx, 32
            or rax, rdx
            mov cycles, rax   ; a QWORD
            CPUID

Now one question is "why cpuid after rdtsc?" I've seen this also in a SOF answer:
QuoteI suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need both rdtsc + cpuid and rdtscp + cpuid for correct measurements) –
Necrolis
Sep 28, 2012 at 7:25

But it doesn't make sense to me: why use the serialising instruction after the measurement?

Btw this is not to criticise your code, which is very good - I am trying to find the one correct solution for my 32-bit CyCt* macros. Even at SOF they are struggling :cool:

HSE

Quote from: jj2007 on July 01, 2022, 07:36:34 AM
But it doesn't make sense to me: why use the serialising instruction after the measurement?

Because in the end your are also serialising RDTSCP. Still can happen that RDTSCP is executed after "mov cycles_high1 , rdx; mov cycles_low1  , rax" and you obtain an spurius value, but probability is lower. That was Paoloni work, less spurious values were obtained in that way.

Quote from: jj2007 on July 01, 2022, 07:36:34 AM
            RDTSCP
            shl rdx, 32
            or rax, rdx
            mov cycles, rax   ; a QWORD
            CPUID

That is one more instruction inside the critical code (I called that The glass). (for magnification glass, you know  :biggrin:)
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on July 01, 2022, 07:49:50 AMThat is one more instruction inside the critical code

But one less memory move...

HSE

Quote from: jj2007 on July 01, 2022, 08:58:41 AM
But one less memory move...

This is technology, not science. It's not much about logic than about measured deviations. You can test if that result in less deviations, perhaps you are wright  :thumbsup: 
Equations in Assembly: SmplMath