News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Min And Max function?

Started by Farabi, June 13, 2012, 01:25:00 PM

Previous topic - Next topic

RuiLoureiro

MichaelW,
          I installed masm32 v10 and now i can run the example
          With
          fld4  FLT_MAX
          fld4 -FLT_MAX
          i get 63         
          Without it i get only 47
          But with DEC ECX only 46         
          Ok 47 to 46 is not too much
          but my logic says we dont need to compare the same
          values 2 times.
          I have P4 3GHz

KeepingRealBusy

Quote from: MichaelW on June 14, 2012, 02:36:02 PM
The only advantage I can see of setting min/max to the first element (or any other particular element) instead of setting it to +/- FLT_MAX would be if there is a significant error in the value of FLT_MAX, and the array contains values that are close to +/- FLT_MAX.

Michael,

If the array held the values 1,2,3,4,5, then setting the max to +FLT_MAX would end up with a max value of +FLT_MAX when it should be 5. Always seed max and min with the first value, predecrement the index,  then loop with jnz (since the first element was the basis of min/max) instead of jns. Jns is required if you are summing an array from the end to the start and need to include the first element.

Dave.

MichaelW

I preset max to -FLT_MAX and min to +FLT_MAX. Since FLT_MAX came from the Microsoft header files, I'm assuming that it is correct.

I will concede that decrementing the count, seeding with the first value, and looping with jnz will have a significant speed advantage on very small arrays.


Well Microsoft, here's another nice mess you've gotten us into.

KeepingRealBusy

Quote from: MichaelW on June 15, 2012, 11:50:20 AM
I preset max to -FLT_MAX and min to +FLT_MAX. Since FLT_MAX came from the Microsoft header files, I'm assuming that it is correct.

Michael,

I think the opposite should be true: preset min to -FLT_MAX and max to +FLT_MIN, that way the first entry compared will be saved as either a max or min. You will  notice, though, that you will have two useless compares and conditional jumps over the number of instructions required to just set max and min to the first entry.

Another speed up, especially for huge arrays of large numbers, would be to just save the index (or pointer to the value) of the max and/or min value, and then only save the value when the end of the array was found (or first entry if scanning in reverse).

Dave.

dedndave

Quote from: KeepingRealBusy on June 15, 2012, 01:16:42 PMI think the opposite should be true: preset min to -FLT_MAX and max to +FLT_MIN...

i think you have that backwards
if you set the "current min" to +FLT_MAX, any value will replace it
but, pre-loading the first value to initialize min and max seems like a better way to go

KeepingRealBusy

I just knew that I would screw it up. But saving the first as min and max is correct.

Dave.

jj2007

Quote from: KeepingRealBusy on June 15, 2012, 01:16:42 PM
Another speed up .. would be to just save the index (or pointer to the value) of the max and/or min value, and then only save the value when the end of the array was found (or first entry if scanning in reverse).

Dave.

Against what would you compare each element then?

MichaelW

I converted qword's code to a procedure and applied Dave's modifications (except the last) to my code, and did a cycle count comparison.

;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
      array real4 -8.8, -3.9, 111.5, 0.5, 3.6, 1.2, 4.9, 9.9, -98.2, 0.0
      r4    real4 ?
      r8    real8 ?
      r8min real8 ?
      r8max real8 ?
.code
;==============================================================================

fltmax proc p:dword, n:dword
    local _max:real4
    mov ecx, n
    dec ecx
    mov edx, p
    fld real4 ptr [edx+ecx*4]
    fstp _max
  L0:
    fld real4 ptr [edx+ecx*4]
    fld _max
    fcomip st, st(1)
    ja  L1
    fst _max
  L1:
    fstp st
    sub ecx, 1
    jnz L0
    fld _max
    ret
fltmax endp

;==============================================================================

fltmin proc p:dword, n:dword
    local _min:real4
    mov ecx, n
    dec ecx
    mov edx, p
    fld real4 ptr [edx+ecx*4]
    fstp _min
  L0:
    fld real4 ptr [edx+ecx*4]
    fld _min
    fcomip st, st(1)
    jb  L1
    fst _min
  L1:
    fstp st
    sub ecx, 1
    jnz L0
    fld _min
    ret
fltmin endp

;==============================================================================

fltminmax proc p:dword, n:dword
    mov edx, p
    mov ecx, 1
    mov eax, n
    dec eax
    fld REAL4 ptr [edx]
    fld st
    .while ecx < eax
        fld REAL4 ptr [edx+ecx*8]
        fxch st(1)
        fcomi st,st(1)
        fcmovnbe st,st(1)
        fxch st(2)
        fcomi st,st(1)
        fcmovb st,st(1)
        fstp st(1)
        fxch
        lea ecx,[ecx+1]
    .endw
    ret
fltminmax endp
;fstp r8Min
;fstp r8Max

;==============================================================================
start:
;==============================================================================
    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke fltmin, addr array, lengthof array
    fstp r8
    printf("%.1f\t",r8)
    invoke fltmax, addr array, lengthof array
    fstp r8
    printf("%.1f\n",r8)
    invoke fltminmax, addr array, lengthof array
    fstp r8min
    fstp r8max
    printf("%.1f\t%.1f\n\n", r8min, r8max)

    invoke Sleep, 4000

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltmin, addr array, lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltmin, addr array, lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltminmax, addr array, lengthof array
        fstp r8
        fstp r8
    counter_end
    printf("%d cycles\n\n", eax)

    inkey
    exit
;==============================================================================
end start

Typical on a P3:

0 cycles
90 cycles
90 cycles
90 cycles

Typical on a P4 (Northwood):

0 cycles
53 cycles
49 cycles
152 cycles


For my code I suspect that retaining the working min and max values in the FPU and would be faster, but I can't as yet see any way to do that without a substantial increase in the number of FPU instructions.

Another possibility that I played with, that I think should be much faster, is to do the comparisons entirely with integer instructions using the methods described here:

http://www.cygnus-software.com/papers/comparingfloats/Comparing%20floating%20point%20numbers.htm



Well Microsoft, here's another nice mess you've gotten us into.

RuiLoureiro

MichaelW,
          See your code. You dont test fltmax but
          fltmin 2 times
          Meanwhile
          Here is the results on my P4 3GHz

-98.2   111.5
-98.2   111.5

0 cycles
106 cycles
105 cycles
247 cycles

jj2007

Attached See reply #42 for a version including MichaelW's fltminmax (with a tiny correction: fld REAL4 ptr [edx+ecx*4]).

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Getting min & max for 10000000 REAL8 values, version B:
66868 µs for FPU
50010 µs for ArrayMinMax, REAL4
56519 µs for ArrayMinMax, REAL8
29226 µs for SSE2, REAL8
59826 µs for fltminmax

66922 µs for FPU
50006 µs for ArrayMinMax, REAL4
56466 µs for ArrayMinMax, REAL8
29395 µs for SSE2, REAL8
59925 µs for fltminmax

66974 µs for FPU
50069 µs for ArrayMinMax, REAL4
56522 µs for ArrayMinMax, REAL8
29286 µs for SSE2, REAL8
60750 µs for fltminmax

Results:
ArrayMinMax=    -888.887715/999.998946
r4MinMax=       -888.887756/999.998779
r8MinMax=       -888.887786/999.998822
SSE2Min=        -888.887910/999.998720

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Getting min & max for 10000000 REAL8 values, version B:
103484 µs for FPU
28553 µs for ArrayMinMax, REAL4
46333 µs for ArrayMinMax, REAL8
27718 µs for SSE2, REAL8
102134 µs for fltminmax

104330 µs for FPU
25788 µs for ArrayMinMax, REAL4
44979 µs for ArrayMinMax, REAL8
27742 µs for SSE2, REAL8
101988 µs for fltminmax

106032 µs for FPU
28141 µs for ArrayMinMax, REAL4
47526 µs for ArrayMinMax, REAL8
27644 µs for SSE2, REAL8
101950 µs for fltminmax

RuiLoureiro

Here is Mymin and Mymax. I explained what i did

Here the results:
-198.4  421.7
-198.4  421.7
-198.4  302223021437139660000000000000000.0

0 cycles
658 cycles      ; Mymin
673 cycles      ; Mymax
854 cycles      ; fltmin
751 cycles      ; fltmax
5957 cycles

Press any key to continue ...
Quote
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Mymin   proc    p:dword, n:dword

        mov     ecx, [esp+8]    ;n
        sub     ecx, 1
        mov     edx, [esp+4]    ;p
        fld     real4 ptr [edx+ecx*4]       ; set st(0) to MIN value
        sub     ecx, 1                      ; point to the next value
  L0:
        fld     real4 ptr [edx+ecx*4]
        fcomi   st, st(1)                   ; compare st(1)=MIN with st(0)
        jae     L1
        fstp    st(1)                       ; remove the MIM, st(0) is the MIN
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
  L1:   fstp    st
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
Mymin   endp
;...***---
Mymax   proc    p:dword, n:dword

        mov     ecx, [esp+8]    ;n
        sub     ecx, 1
        mov     edx, [esp+4]    ;p
        fld     real4 ptr [edx+ecx*4]       ; set st(0) to MAX value
        sub     ecx, 1                      ; point to the next value
       
  L0:
        fld     real4 ptr [edx+ecx*4]       ; point to the next value
        fcomi   st, st(1)                   ; compare st(1)=MAX with st(0)
        jbe     L1
       
        fstp    st(1)                       ; remove the MAx, st(0) is the MAX
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8               
  L1:
        fstp    st
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
Mymax   endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
lenarray    equ 90
.data
align 4
      array  real4 -8.8, -3.9, 111.5, 0.5, 3.6, 1.2, 4.9, 9.9, -98.2, 0.0
      array1 real4 -8.7, -3.5, 121.5, 1.5, 8.6, 2.3, 5.9, 19.9, -198.3, 1.0
      array2 real4 -6.7, -2.5, 221.5, 4.5, 7.6, 1.3, 6.9, 0.9, -97.4, 2.0
      array3 real4 -8.9, -3.5, 121.5, 3.5, -2.6, 1.5, 5.9, 1.9, -98.4, 1.1
      array4 real4 -5.7, -1.5, 421.7, 1.6, 2.6, 4.3, 15.9, 7.9,-198.4, 1.2
      array5 real4 -8.7, -3.5, 121.5, 1.5, 8.6, 2.3, 5.9, 19.9, -198.3, 1.0
      array6 real4 -6.7, -2.5, 221.5, 4.5, 7.6, 1.3, 6.9, 0.9, -97.4, 2.0
      array7 real4 -8.9, -3.5, 121.5, 3.5, -2.6, 1.5, 5.9, 1.9, -98.4, 1.1
      array8 real4 -5.7, -1.5, 421.7, 1.6, 2.6, 4.3, 15.9, 7.9,-198.4, 1.2
     
      r4    real4 ?
      r8    real8 ?
      r8min real8 ?
      r8max real8 ?
.code
;==============================================================================
;==============================================================================
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Mymin   proc    p:dword, n:dword

        mov     ecx, [esp+8]    ;n
        sub     ecx, 1
        mov     edx, [esp+4]    ;p
        fld     real4 ptr [edx+ecx*4]       ; set st(0) to MIN value
        sub     ecx, 1                      ; point to the next value
  L0:
        fld     real4 ptr [edx+ecx*4]
        fcomi   st, st(1)                   ; compare st(1)=MIN with st(0)
        jae     L1
        fstp    st(1)                       ; remove the MIM, st(0) is the MIN
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
  L1:   fstp    st
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
Mymin   endp

Mymax   proc    p:dword, n:dword

        mov     ecx, [esp+8]    ;n
        sub     ecx, 1
        mov     edx, [esp+4]    ;p

        fld     real4 ptr [edx+ecx*4]       ; set st(0) to MAX value
        sub     ecx, 1                      ; point to the next value
       
  L0:
        fld     real4 ptr [edx+ecx*4]       ; point to the next value
        fcomi   st, st(1)                   ; compare st(1)=MAX with st(0)
        jbe     L1
       
        fstp    st(1)                       ; remove the MAx, st(0) is the MAX
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8               
  L1:
        fstp    st
        sub     ecx, 1
        jns     L0                          ; if ecx>0 or ecx=0 loop to L0
        ret     8
Mymax   endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

fltmax proc p:dword, n:dword
    local _max:real4
    mov ecx, n
    dec ecx
    mov edx, p
    fld real4 ptr [edx+ecx*4]
    fstp _max
    ;dec ecx
  L0:
    fld real4 ptr [edx+ecx*4]
    fld _max
    fcomip st, st(1)
    ja   L1
    ;jae  L1
    fst _max
  L1:
    fstp st
    sub ecx, 1
    jnz L0                          ; jnz ? IF ecx = 0 we must compare !
    fld _max
    ret
fltmax endp

;==============================================================================

fltmin proc p:dword, n:dword
    local _min:real4
    mov ecx, n
    dec ecx
    mov edx, p
    fld real4 ptr [edx+ecx*4]
    fstp _min
    ;dec ecx
  L0:
    fld real4 ptr [edx+ecx*4]
    fld _min
    fcomip st, st(1)
    jb  L1
    ;jbe L1
    fst _min
  L1:
    fstp st
    sub ecx, 1
    jnz L0                          ; jnz ? IF ecx = 0 we must compare !
    fld _min
    ret
fltmin endp

;==============================================================================

fltminmax proc p:dword, n:dword
    mov edx, p
    mov ecx, 1
    mov eax, n
    dec eax
    fld REAL4 ptr [edx]
    fld st
    .while ecx < eax
        fld REAL4 ptr [edx+ecx*8]
        fxch st(1)
        fcomi st,st(1)
        fcmovnbe st,st(1)
        fxch st(2)
        fcomi st,st(1)
        fcmovb st,st(1)
        fstp st(1)
        fxch
        lea ecx,[ecx+1]
    .endw
    ret
fltminmax endp
;fstp r8Min
;fstp r8Max

;==============================================================================
start:
;==============================================================================
    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke Mymin, addr array, lenarray     ;lengthof array
    fstp r8
    printf("%.1f\t",r8)
    invoke Mymax, addr array, lenarray     ;lengthof array
    fstp r8
    printf("%.1f\n",r8)
;----------------------------------------
    invoke fltmin, addr array, lenarray     ;lengthof array
    fstp r8
    printf("%.1f\t",r8)
    invoke fltmax, addr array, lenarray     ;lengthof array
    fstp r8
    printf("%.1f\n",r8)
    invoke fltminmax, addr array, lenarray  ;lengthof array
    fstp r8min
    fstp r8max
    printf("%.1f\t%.1f\n\n", r8min, r8max)

    invoke Sleep, 4000

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
    counter_end
    printf("%d cycles\n", eax)
;------------------------------
    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke Mymin, addr array, lenarray     ;lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke Mymax, addr array, lenarray     ;lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)
;--------------------------------
    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltmin, addr array, lenarray     ;lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltmin, addr array, lenarray     ;lengthof array
        fstp r8
    counter_end
    printf("%d cycles\n", eax)

    counter_begin 10000000, REALTIME_PRIORITY_CLASS
        invoke fltminmax, addr array, lenarray  ;lengthof array
        fstp r8
        fstp r8
    counter_end
    printf("%d cycles\n\n", eax)


    inkey
    exit
;==============================================================================
end start


jj2007

Quote from: RuiLoureiro on June 15, 2012, 11:24:48 PM
Here is Mymin and Mymax.

Thanks, added to testbed:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Getting min & max for 10000000 REAL8 values, version B:
67004 µs for FPU
49927 µs for ArrayMinMax, REAL4
56748 µs for ArrayMinMax, REAL8
29678 µs for SSE2, REAL8
59847 µs for fltminmax
20046 µs for Mymin
28936 µs for Mymin
50009 µs for Mymin+Mymax

66939 µs for FPU
50074 µs for ArrayMinMax, REAL4
56960 µs for ArrayMinMax, REAL8
29261 µs for SSE2, REAL8
60256 µs for fltminmax
20170 µs for Mymin
29049 µs for Mymin
48802 µs for Mymin+Mymax

Results (numbers are always a bit different because the pseudo random generator produces a new set every time):
ArrayMinMax=    -888.887786/999.998822
r4MinMax=       -888.887695/999.998962
r4MinMax=       -888.887634/999.998535
r8MinMax=       -888.887485/999.998558
SSE2Min=        -888.887614/999.998955

RuiLoureiro

 :biggrin:
Jochen,
        Thanks for Mymin and Mymax  ;)
        You are too fast !
       
       
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Getting min & max for 10000000 REAL8 values, version B:
121636 µs for FPU
43742 µs for ArrayMinMax, REAL4
61251 µs for ArrayMinMax, REAL8
27210 µs for SSE2, REAL8
117418 µs for fltminmax
28273 µs for Mymin
28194 µs for Mymin
52216 µs for Mymin+Mymax

116571 µs for FPU
60052 µs for ArrayMinMax, REAL4
93684 µs for ArrayMinMax, REAL8
38531 µs for SSE2, REAL8
113058 µs for fltminmax
27492 µs for Mymin
24843 µs for Mymin
50303 µs for Mymin+Mymax

Results:
ArrayMinMax=    -888.887786/999.998822
r4MinMax=       -888.887695/999.998962
r4MinMax=       -888.887634/999.998535
r8MinMax=       -888.887485/999.998558
SSE2Min=        -888.887614/999.998955

dedndave

Quote from: jj2007 on June 15, 2012, 11:36:13 PMnumbers are always a bit different because the pseudo random generator produces a new set every time

seems like they ought to test the same array   :redface: