Author Topic: Slight modification to StrLen in the masm32 library - 9% speed increase  (Read 13115 times)

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #15 on: September 13, 2015, 09:37:30 PM »
if you look at the template i posted earlier, i added a Sleep call at the beginning of the program
that's the only Sleep call you should have to add

examine the macro code written by Michael Webster
if i remember correctly, there is a Sleep call in there, at the appropriate location to start a new time-slice

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #16 on: September 13, 2015, 09:54:00 PM »
Yes, I know dave. Here is with Sleep

Code: [Select]
Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)
2463 cycles - sslen
2425 cycles - sslen
2429 cycles - sslen
2425 cycles - sslen
2428 cycles - sslen
2431 cycles - sslen
2430 cycles - sslen
2426 cycles - sslen
2427 cycles - sslen
2428 cycles - sslen
1200 - bytes length

1072 cycles - StrLen
1070 cycles - StrLen
1072 cycles - StrLen
1072 cycles - StrLen
1074 cycles - StrLen
1073 cycles - StrLen
1074 cycles - StrLen
1070 cycles - StrLen
1072 cycles - StrLen
1072 cycles - StrLen
1200 - bytes length

1520 µs - sslen
1523 µs - sslen
1520 µs - sslen
1531 µs - sslen
1526 µs - sslen
1528 µs - sslen
1525 µs - sslen
1528 µs - sslen
1521 µs - sslen
1521 µs - sslen
1200 - bytes length

644 µs - StrLen
644 µs - StrLen
644 µs - StrLen
649 µs - StrLen
650 µs - StrLen
648 µs - StrLen
645 µs - StrLen
649 µs - StrLen
646 µs - StrLen
644 µs - StrLen
1200 - bytes length
« Last Edit: July 20, 2018, 04:06:47 AM by zedd151 »
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #17 on: September 13, 2015, 10:03:16 PM »
As predicted the results after I removed sleep again, were ambiguous at best,
erratic at worst. Very touchy these timers/counters

Here is one that isn't real bad (no sleep)
Code: [Select]
Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)
2461 cycles - sslen
2446 cycles - sslen
2439 cycles - sslen
2429 cycles - sslen
2433 cycles - sslen
2425 cycles - sslen
2433 cycles - sslen
2426 cycles - sslen
2427 cycles - sslen
2426 cycles - sslen
1200 - bytes length

1073 cycles - StrLen
1073 cycles - StrLen
1075 cycles - StrLen
1072 cycles - StrLen
1074 cycles - StrLen
1072 cycles - StrLen
1071 cycles - StrLen
1070 cycles - StrLen
1070 cycles - StrLen
1072 cycles - StrLen
1200 - bytes length

1520 µs - sslen
1522 µs - sslen
1521 µs - sslen
1520 µs - sslen
1521 µs - sslen
1520 µs - sslen
1521 µs - sslen
1520 µs - sslen
1523 µs - sslen
1520 µs - sslen
1200 - bytes length

852 µs - StrLen
824 µs - StrLen
644 µs - StrLen
643 µs - StrLen
674 µs - StrLen
968 µs - StrLen
644 µs - StrLen
643 µs - StrLen
644 µs - StrLen
644 µs - StrLen
1200 - bytes length
But the next one I ran, the numbers were all over the place. Maybe my laptop is going bonkers?

At one point seem to be working fine, make a change, then all of a sudden it's back to square one.

:dazzled:

Before I had added the 'sslen' proc to the mix it was working well without Sleep
all I did was add two more tests (for sslen) and everything changed.

M$ has cast a spell on the game. lol

I guess this is one program that need to sleep as much as I do.  ZZZzzz

I think that what we need is a sure fire way, to adjust for different algos, size, etc.
The settings that work for one, may or may not work for another, it's very frustrating
to say the least.

I am just about to throw in the towel at this juncture. Probably will revisit it in the
future though, if I am ever in the need to time something that is 'mission-critical'.

Otherwise I have more pressing issues to deal with IRL. .....................

Another note it makes a big difference whether the function under test normally runs fast or slow.
The size of the function, and/or the data it processes also seems to have a big impact on the
performance of these counter/timer macros. More experimenting will be necessary..  :idea:
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #18 on: September 13, 2015, 10:24:40 PM »
you could try this:

find the largest routine to test
change the attributes of that memory page to PAGE_EXECUTE_READWRITE with VirtualProtect
run that test
for each successive test, copy the code to that address and run it from there

that way, all code is executed from the same address

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #19 on: September 13, 2015, 10:55:28 PM »
that way, all code is executed from the same address

I had actually though of something similar.
Sounds feasible enough. But it also would add unwanted complications into the mix I would think.
I'll have to think on it for a bit...
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #20 on: September 13, 2015, 11:57:34 PM »
ok - i examined Michael's code
the one that has Sleep in it is in counter2.asm (ctr_begin and ctr_end macros)

http://masm32.com/board/index.php?topic=49.0

attached is a little test program that uses VirtualProtect to execute 2 routines from the same address

oddly, i can't get the macro to accept a variable for loop count
i thought i had done that before   :redface:

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #21 on: September 14, 2015, 12:50:42 AM »
attached is a little test program that uses VirtualProtect to execute 2 routines from the same address...

Ok good.
Code: [Select]
186 212 212 212 212
116 116 116 116 116

Okay, I have it here in front of me, what am I looking at?
Or is this just an example of how to use VirtualProtect?

Quote
oddly, i can't get the macro to accept a variable for loop count
LoopCount   DD    0
LoopCount = 10  ;)

begin_counter LoopCount  :biggrin:

This snippet from counter2.asm looks interesting: (I guess text options don't work in the 'code' window) :(
Code: [Select]
    ;;
    ;; [color=red][b]Capture lowest cycle count that occurs in a single loop.[/b][/color]
    ;;
    cmp edx, DWORD PTR __ctr__qword__overhead__ + 4
    jne @F
    cmp eax, DWORD PTR __ctr__qword__overhead__
  @@:
    jnb @F
    mov DWORD PTR __ctr__qword__overhead__, eax
    mov DWORD PTR __ctr__qword__overhead__ + 4, edx
  @@:


I did something similar when I was running recursive loops saving only the lowest count.

But there were problems with this version, for low or empty algos, sometimes returned a negative value.
Should have at least put in place some sort of checking so no result < 0 is given.

This snippet is from the test program that came with counter2.asm

Code: [Select]
HIGH_PRIORITY_CLASS
-12 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
-12 cycles, nops 4
0 cycles, mul ecx
-12 cycles, rol ecx,32
0 cycles, rcr ecx,31
72 cycles, div ecx
72 cycles, StrLen

REALTIME_PRIORITY_CLASS
-12 cycles, empty
-36 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-12 cycles, rol ecx,32
-24 cycles, rcr ecx,31
72 cycles, div ecx
72 cycles, StrLen

Of course further work was done, and I think those problems were addressed.
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #22 on: September 14, 2015, 01:45:05 AM »
Another test dave. This time what I had done, I took one of my original testbeds,
complete with the Sleep call intact. This is the result:

Code: [Select]
4783 cycles
3364 cycles
3376 cycles
3360 cycles
3366 cycles
3367 cycles
3365 cycles
3352 cycles
3349 cycles
3357 cycles

2109 us.
2105 us.
2102 us.
2104 us.
2103 us.
2100 us.
2103 us.
2105 us.
2102 us.
2103 us.

What I did next, I added a 'dummy test' just before the loop, thus absorbing
the first result - which otherwise would have been as above. Sort of what you might
call spin up. (It calls the same function that the real tests call - just doesn't display the
results - plus it is outside of the loop)

Code: [Select]
3384 cycles
3379 cycles
3382 cycles
3393 cycles
3382 cycles
3380 cycles
3384 cycles
3382 cycles
3380 cycles
3383 cycles

2117 us.
2117 us.
2118 us.
2119 us.
2117 us.
2117 us.
2118 us.
2119 us.
2119 us.
2117 us.

The timer seems pretty stable. The counter on the other hand is hard to
get a good reading sometimes. RDTSC is rather unruly. I have read some
complaints about it during my travels on the net. So it appears us asm
coders are not alone.

btw, the text under test was the asm source, that is why this second set has slightly larger return values.

I think that looks much better than the first...

source:
Code: [Select]
    .nolist
        include \masm32\include\masm32rt.inc
    .686
    .XMM
    .MMX
        include \masm32\macros\timers.asm
        LoadFilex   proto :dword
        StrLenz     proto :dword
        szLenz      proto :dword
    .data
        align 16
        testtext    dd 0
       
        LOOP_COUNT  = 200000
        rct         = 10
   
    .code
    start:
        invoke GetCurrentProcess
        invoke SetProcessAffinityMask,eax,1
        fn LoadFilex, "Testbed.asm"
        mov testtext, esi
        push 1
        call ShowCpu
    ; --------------------- test 1 ------------------------
        invoke Sleep, 1
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
        invoke StrLen, testtext
        counter_end
        repeat rct
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
        invoke StrLen, testtext
        counter_end
        print str$(eax), 20h, "cycles - StrLen", 13, 10
        endm
        print chr$(13, 10)
    ; --------------------- test 2 ------------------------
        repeat rct
        invoke Sleep, 1
        timer_begin 1000000, HIGH_PRIORITY_CLASS ; set for microsecond timing
        invoke StrLen, testtext
        timer_end
        print str$(eax), 20h, "us. - StrLen", 13, 10
        endm
;        print chr$(13, 10)
    ; --------------------- test 3 ------------------------
;        align 16
;        repeat rct
;        invoke Sleep, 250
;        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

;        counter_end
;        print str$(eax), 20h, "cycles - test 3", 13, 10
;        endm
;        print chr$(13, 10)
    ; --------------------- test 4 ------------------------
;        align 16
;        repeat rct
;        invoke Sleep, 250
;        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

;        counter_end
;        print str$(eax), 20h, "cycles - test 4", 13, 10
;        endm
        print chr$(13, 10)
    ; -------------------- end tests ----------------------
   
    invoke GlobalFree, testtext
    inkey chr$("--- ok ---", 13)

    exit
   

    ShowCpu proc
        pushad
        sub esp, 80
        mov edi, esp
        xor ebp, ebp
   
        .Repeat
            lea eax, [ebp+80000002h]
            db 0Fh, 0A2h
            stosd
            mov eax, ebx
            stosd
            mov eax, ecx
            stosd
            mov eax, edx
            stosd
            inc ebp
        .Until ebp>=3
   
        push 1
        pop eax
        db 0Fh, 0A2h
        xor ebx, ebx
        xor esi, esi
        bt edx, 25
        adc ebx, esi
        bt edx, 26
        adc ebx, esi
        bt ecx, esi
        adc ebx, esi
        bt ecx, 9
        adc ebx, esi
        dec dword ptr [esp+4+32+80]
   
        .if Zero?
            mov edi, esp
           
            .Repeat
            .Break .if byte ptr [edi]!=32
                inc edi
            .Until 0
       
            .if byte ptr [edi]<32
                print chr$("pre-P4")
            .else
                print edi
            .endif
           
            .if ebx
                print chr$(32, 40, "SSE")
                print str$(ebx), 41, 13, 10
            .endif
        .endif
   
        add esp, 80
        mov [esp+32-4], ebx
        ifdef MbBufferInit
        call MbBufferInit
        endif
        popad
        ret 4
    ShowCpu endp
   
    LoadFilex proc lpName:dword
    local hFile :dword, fl :dword, bRead :dword, hMem$ :dword
        invoke CreateFile, lpName, 80000000h, 0, 0, 3, 80h, 0
        mov hFile, eax
        invoke GetFileSize, hFile, 0
        inc eax
        mov fl, eax
        invoke GlobalAlloc, GPTR, fl
        mov hMem$, eax
        invoke ReadFile, hFile, hMem$, fl, addr bRead, 0
        invoke CloseHandle, hFile
        mov esi, hMem$
        mov ecx, fl
        ret
    LoadFilex endp

   
    end start

All good.
Now all we need is a better way to set the LoopCount.
I was thinking of trying to automate that process.
Run several cycles, until the returns are consistently low and
within a certain range. Don't know how easy it would be to implement though.

But it would make life easier- esp. when going from an algae that takes 20-30
cycles to one that is well into the thousands of cycles within the same set
of tests. That doesn't happen often but would be a good benchmark for
these counters. Has to handle a wide range, without much user intervention.

I think what would be a good idea, since it is know that the LoopCount needed
is inversely proportional to the algo size+data used by the algo, try and
make a set of rules for the setting of the LoopCount.

Easier said than done, I know. I wish I had the means to graph many
algos, their byte count, size, data size, versus the LoopCount needed for accurate
and repeatable results. Then we will have a clearer picture what would be needed
in a range of different tests.
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #23 on: September 14, 2015, 04:44:45 AM »
well - i tried a few things
i wanted to use a different loop count for different routines
so, i added an argument to the TestIt function (with prototype)

Code: [Select]
TestIt PROC dwLoopCount:DWORD

        mov     ecx,5

Loop00: push    ecx

        ctr_begin dwLoopCount,HIGH_PRIORITY_CLASS

.
.
.
        pop     ecx
        dec     ecx
        jnz     Loop00
.
.
.

gave me an assembler error   :(

the next 2 attempts are a bit mystifying
i am no macro expert, so i don't know why one or both didn't work
but, different results using EAX or EDX   :redface:

Code: [Select]
TestIt PROC dwLoopCount:DWORD

        mov     eax,dwLoopCount
        mov     ecx,5

Loop00: push    eax
        push    ecx

        ctr_begin eax,HIGH_PRIORITY_CLASS

.
.
.
        pop     ecx
        pop     eax
        dec     ecx
        jnz     Loop00
.
.
.

Code: [Select]
TestIt PROC dwLoopCount:DWORD

        mov     edx,dwLoopCount
        mov     ecx,5

Loop00: push    edx
        push    ecx

        ctr_begin edx,HIGH_PRIORITY_CLASS

.
.
.
        pop     ecx
        pop     edx
        dec     ecx
        jnz     Loop00
.
.
.

at any rate, i did just as outlined in Reply #18

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #24 on: September 14, 2015, 08:11:44 AM »
@dedndave,

in the ctr_begin macro, the symbol loopcount is used like this:

Code: [Select]
mov __ctr__loop__count__, loopcount
mov __ctr__loop__counter__, loopcount

so loopcount must be a symbol that will work in these mov statements. It should be one of these:

Code: [Select]
ctr_begin 10000,HIGH_PRIORITY_CLASS

; or

LOOPCOUNT = 10000
ctr_begin LOOPCOUNT,HIGH_PRIORITY_CLASS

These won't give an error:

Code: [Select]
ctr_begin eax,HIGH_PRIORITY_CLASS

; or

ctr_begin edx,HIGH_PRIORITY_CLASS

because the macro will produce these statements:

Code: [Select]
mov __ctr__loop__count__, eax
mov __ctr__loop__counter__, eax

Unfortunately, it's no good to use eax or edx because later in the ctr_begin macro, after eax and edx get used (trashed), loopcount is used again:

Code: [Select]
; ....
jnz   @B                  ;; End of reference loop
mov __ctr__loop__counter__, loopcount
; ...

This statement of course becomes

Code: [Select]
mov __ctr__loop__counter__, eax
which is bad because eax (or edx) no longer has the original value. They each have different values, that's why they give different results, both wrong.

Finally, loopcount obviously can't be a memory ref, that's trying to mov memory to memory, so this doesn't work:

Code: [Select]
TestIt PROC dwLoopCount:DWORD
; ...
ctr_begin dwLoopCount,HIGH_PRIORITY_CLASS ; no good

Since you want to have an adjustable loopcount, you need to change the ctr_begin code.

One way: in ctr_begin, loopcount is stored to __ctr__loop__count__ and __ctr__loop__counter__. __ctr__loop__count__ is not modified; only __ctr__loop__counter__ is. So, later you could restore __ctr__loop__counter__'s value from __ctr__loop__count__ instead of loopcount (which has been trashed, if it's eax or edx). Change the line I quoted above like this:

Code: [Select]
; ....
jnz   @B                  ;; End of reference loop
m2m __ctr__loop__counter__, __ctr__loop__count__
; ...

Another alternative is to change the mov's involved to m2m's, and use dwLoopCount, as you tried at first. Haven't tested either technique, I think they'll work, but you get the idea
I am NaN ;)

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #25 on: September 14, 2015, 10:58:29 AM »
i got it to work by using EDI   :biggrin:
CPUID does not use ESI, EDI, or EBP

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #26 on: September 14, 2015, 12:01:25 PM »
 :icon14:
I am NaN ;)

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #27 on: September 14, 2015, 12:06:54 PM »
Hi dave. I will take a look at that.

I too have been working on something.

I completely stripped the counter macros down to the bare minimum.
Process priority is set within the macro, any looping is done outside of the macro.

This is the first edition.

"Bare bones - non-looping cycle count macro"

Code: [Select]

    include \masm32\include\masm32rt.inc
    .586
   

; Bare bones cycle count macros
; Process Priority is internall defined
; Looping should be done externally

ctr_beginz MACRO
    IFNDEF __ctr__stuff__defined__
        __ctr__stuff__defined__ equ <1>
        .data
            ALIGN 16
            __ctr__qword__count__     dq 0
            __ctr__qword__overhead__  dq 0
            __ctr__loop__count__      dd 0
            __ctr__loop__counter__    dd 0
        .code
    ENDIF
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS
    xor eax, eax              ;; Warm up the CPUID instruction
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    invoke Sleep, 1           ;; Start a new time slice
    push ebx                  ;; Preserve EBX around CPUID
    xor eax, eax              ;; Use same CPUID input value for each call
    cpuid                     ;; Flush pipe & wait for pending ops to finish
    rdtsc                     ;; Read Time Stamp Counter
    pop ebx
ENDM

ctr_endz MACRO
    push ebx                  ;; Preserve EBX around CPUID
    xor eax, eax              ;; Use same CPUID input value for each call
    cpuid                     ;; Flush pipe & wait for pending ops to finish
    rdtsc                     ;; Read Time Stamp Counter
    pop ebx
    push eax                  ; preserve eax
    push edx                  ; preserve edx
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
    pop edx
    pop eax
ENDM

    .data
    ctrloop dd 0
    bcount  dd 0
    ecount  dd 0
   
    .code

; the following has no algo between start and end (Except for saving eax)
; therefore the result should be the cumulative overhead


start:

    mov ctrloop, 20
    top:
   
ctr_beginz
    push eax       ; save eax 1 byte
   
ctr_endz
    mov ecount, eax
    pop eax
    mov bcount, eax

    mov eax, ecount
    sub eax, bcount
    print str$(eax)," overhead cycle count",13,10

    dec ctrloop
    jnz top

    inkey
    exit

end start

As it is this test procedure basically returns the overhead of the entire cycle counting process.
Will have to figure out a reliable way, to soubtract the overhead at each pass of the macros, i.e. every time
the process loops.

I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #28 on: September 14, 2015, 01:09:56 PM »
Here is a little test piece derived from the stripped down macros.
It attempts to remove the overhead cycle count leaving only the 'true' cycle count for the test.
Since the test is empty, should return zero.

I noticed that RDTSC seems to always return a value that is always divisible by 4 btw
I remember reading somewhere about this, I forget where.

Results:

Code: [Select]
-12 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
12 should be cycle count minus overhead  since function is empty should be zeroor close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
12 should be cycle count minus overhead  since function is empty should be zeroor close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
60 should be cycle count minus overhead  since function is empty should be zeroor close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
12 should be cycle count minus overhead  since function is empty should be zeroor close to it
0 should be cycle count minus overhead  since function is empty should be zero or close to it
« Last Edit: July 20, 2018, 04:07:57 AM by zedd151 »
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #29 on: September 14, 2015, 01:20:31 PM »
I found some interesting reading regarding RDTSC..

Using the RDTSC Instruction for Performance Monitoring

I am looking through it now....


Later:

Yup, the methods used in the Intel document closely follow what I have read elsewhere,
including in Michaels macros.

Still reading....
Looking for reason why it seems to return a value always divisible by 4,
unless there is a flaw in my shortened macros.

Here is an interesting snippet from the Intel document:
Code: [Select]
A.1. Using the CPUID Instruction for Serialization

#define cpuid __asm __emit 0fh __asm __emit 0a2h
#define rdtsc __asm __emit 0fh __asm __emit 031h
#include <stdio.h>
void main () {
        int time, subtime;
        float x = 5.0f;
        __asm {
                //  Make three warm-up passes through the timing routine to make
                //  sure that the CPUID and RDTSC instruction are ready
                cpuid
                rdtsc
                mov     subtime, eax
                cpuid
                rdtsc
                sub     eax, subtime
                mov     subtime, eax
                cpuid
                rdtsc
                mov     subtime, eax
                cpuid
                rdtsc
                sub     eax, subtime
                mov     subtime, eax
                cpuid
                rdtsc
                mov     subtime, eax
                cpuid
                rdtsc
                sub     eax, subtime
                mov     subtime, eax    // Only the last value of subtime is kept
                // subtime should now represent the overhead cost of the
                // MOV and CPUID instructions
                fld     x
                fld     x
                cpuid                   // Serialize execution
                rdtsc                   // Read time stamp to EAX
                mov     time, eax
                fdiv                    // Perform division
                cpuid                   // Serialize again for time-stamp read
                rdtsc                           
                sub     eax, time       // Find the difference
                mov     time, eax
        }
        time = time - subtime;  // Subtract the overhead
        printf ("%d\n", time);  // Print total time of divide to screen
}
The above snippet is intended to be used as inline asm in C code .
As you can see, it diiffers slightly from Michaels macros in that it calculates the overhead during the 'warmup period'
I have to look into this..
I'm not always the sharpest knife in the drawer, but I have my moments.  :P