FputoString format

NoCforMe · May 28, 2025, 11:19:08 AM

Quote from: guga on May 28, 2025, 10:06:06 AMTake a look at Raymond´s FpuFLtoA here. I don't know other faster method to convert these values (which are packed) stored in a TenByte without using his method (which i adapted to work with SSE2 only on this specific part of the code).

So again, at the risk of sounding like a broken record^[1]:
Who cares how fast the conversion s are?

[1] Probably not meaningful to those who've never heard a 78 rpm record with a crack ...

guga · May 28, 2025, 11:51:52 AM

No problem, but...Why would I take Raymond's function and make it slower? What's the point of using the function he created using a specific opcode for this type of conversion (fbstp) and making the result dozens of times slower? If you don't use fbstp, the alternative would be to convert each byte to decimal with opcodes like div that would loop up to 10 times to transform each byte into decimal ascii. Do you see how much slower this would make the function? Why would I ruin his code like this, if there is already a specific opcode for this that allows you to do the conversion all at once?

The function I created (adapted from Raymond's) is for general use. I can't assume that someone will use it solely and exclusively to put text (Float ascii, in fact) in an edit box. If the function is for general use, it has to be effective for other uses, especially if the user uses it for large databases, for example. Otherwise, it would be easier to just use printf to generate the float ascii.

The core of the function is already done. Why would I change it and make it slower? There is no harm in making the function faster for general uses.

NoCforMe · May 28, 2025, 12:21:35 PM

I'm not saying, obviously, that you should make it slower. That would be exceedingly stupid.

I'm just saying you shouldn't obsess over how fast it is. Of course you're going to code it in an efficient way that'll probably be at least as fast as Raymond's code (which BTW isn't particularly fast) or faster.

Besides, with your 0.1% of the share of the assembler "market", I'm not sure your efforts make all that much difference anyway.

fearless · May 28, 2025, 06:49:14 PM

https://www.youtube.com/watch?v=kw-U6smcLzk

Siekmanski · May 28, 2025, 11:25:44 PM

Hi all,

I wrote this routine 7 years ago because, I needed 256 real4 values at once at 60Hz in realtime on my screen.
This SIMD routine is 28 times faster then sprintf for real4 values on my old PC.

Edit: latest version

Code Select

align 4
Real4_2_ASCII proc Real4string:DWORD,floatnumber:REAL4

    mov         ecx,Real4string
    mov         eax,floatnumber
    test        eax,eax
    je          message_PosZero
    cmp         eax,080000000h
    je          message_NegZero
    
    ; check floating-point exceptions
    cmp         eax,07F7FFFFFh
    ja          TestExceptions

ProcessReal4:    
    mov         byte ptr [ecx],020h     ; Write " " to the string
    test        eax,80000000h           ; Check the sign bit
    jz          No_SignBit
    mov         byte ptr [ecx],02dh     ; write "-" character to the string
    and         floatnumber,7FFFFFFFh   ; Make it an absolute value
    and         eax,7FFFFFFFh           ; Remove the sign bit
No_SignBit:

    ; Fast Log10(x)-1 routine to calculate the number of digits
    shr         eax,23                  ; Get the 8bit exponent
    sub         eax,127                 ; Adjust for the exponent bias
    cvtsi2ss    xmm0,eax                ; Convert int32 to real4
    mulss       xmm0,Log10_2            ; Approximate Log10(x) == Log10(2) * exponent bits ==  0.30102999566398119 * exponent bits
    addss       xmm0,PowersOfTen[37*4]  ; Add one to get the approximated number of digits from the floating point value
    cvtss2si    eax,xmm0                ; Convert real4 to int32
    mov         ecx,eax                 ; Save approximated number of digits
    mov         edx,38+1                ; Highest possible number of digits + 1
    add         eax,edx                 ; Get the Power Of Ten offset for the digits rounding check

    ; Now do the check to get the exact rounded number of digits from the floating point value
    ; We can do this by comparing it to the closest Power Of Ten below the floating point value  
    movss       xmm0,floatnumber
    comiss      xmm0,PowersOfTen[eax*4-4]
    jc          ExactLog10xMin1         ; Is it below the closest Power Of Ten?
    cmp         ecx,edx                 ; It is above, also check the approximated number of digits
    je          ExactLog10xMin1         ; Is it not above the highest possible number of digits skip adjustment
    dec         edx                     ; Adjust the number of digits by subtracting one
ExactLog10xMin1:                        ; Now we are allmost done to get the exact number of digits
                                        ; There is one exception, the lowest Power Of Ten check value is out of range ( 1.0E+39 )
                                        ; See the last added value in the PowersOfTen table, it's used for the out of range check
    sub         edx,ecx                 ; edx holds the offsets for the PowersOfTen table and the scientific notation string table
    mulss       xmm0,PowersOfTen[edx*4] ; Get the calculated Power Of Ten value and multiply it with the floating point value
    comiss      xmm0,PowersOfTen[38*4]  ; Compare to 10.0
    jnc         ExactNumDigits          ; Is it below 10.0?
    inc         edx                     ; It is below, adjust the offset for the scientific notation string
    mulss       xmm0,PowersOfTen[38*4]  ; Adjust decimal position ( it also solves the out of range issue )
ExactNumDigits:                         ; At this point we have the exact number of digits from the floating point value
    mulss       xmm0,PowersOfTen[42*4]  ; Get the 7 significant digits from the range -1.175494E-38 to 3.402823E+38
    cvtss2si    eax,xmm0                ; We want a Natural Number
    cvtsi2ss    xmm0,eax                ; So, remove the digits after the decimal point

    shufps      xmm0,xmm0,0             ; Splat...., make 4 copies from the real4 number
    movaps      xmm1,xmm0               ; Copy to a total of 8 copies
    mulps       xmm0,dividers           ; Produce base10 numbers
    mulps       xmm1,dividers+16        ; Produce base10 numbers
    movaps      xmm2,xmm0               ; Copy them
    movaps      xmm3,xmm1               ; Copy them
    mulps       xmm2,div10              ; Nullify least significant base10 numbers
    mulps       xmm3,div10              ; Nullify least significant base10 numbers
    cvttps2dq   xmm0,xmm0               ; Truncate remaining fractions
    cvttps2dq   xmm1,xmm1               ; Truncate remaining fractions
    cvtdq2ps    xmm0,xmm0               ; Convert back to real4
    cvtdq2ps    xmm1,xmm1               ; Convert back to real4
    cvttps2dq   xmm2,xmm2               ; Truncate remaining fractions
    cvttps2dq   xmm3,xmm3               ; Truncate remaining fractions
    cvtdq2ps    xmm2,xmm2               ; Convert back to real4
    cvtdq2ps    xmm3,xmm3               ; Convert back to real4
    mulps       xmm2,mul10              ; Move them back in the correct base10 position 
    mulps       xmm3,mul10              ; Move them back in the correct base10 position 
    subps       xmm0,xmm2               ; Subtract to get the extracted digits
    subps       xmm1,xmm3               ; Subtract to get the extracted digits
    cvttps2dq   xmm0,xmm0               ; Convert back to int32
    cvttps2dq   xmm1,xmm1               ; Convert back to int32

    shufps      xmm0,xmm0,11100001b     ; Swap the 2 first digits, the first digit is always zero
                                        ; Now we can write the decimal point for free ( no more memory swaps )
                                        ; Using a prepared ASCIIconverter constant

    packssdw    xmm0,xmm1               ; Pack 8 x 32bit to 8 x 16bit ( signed but, we are within the limit )
    packuswb    xmm0,xmm0               ; Pack 8 x 16bit to 16 x 8bit unsigned
    movq        xmm1,ASCIIconverterE    ; Prepared to insert a decimal point and convert to ASCII in one go
    paddb       xmm1,xmm0               ; Convert the number to ASCII

    mov         edx,Scientific_sz[edx*4]; Get the 4 byte scientific notation string
    mov         ecx,Real4string

    movq        qword ptr [ecx+1],xmm1  ; Write the 7 significant digits
    mov         [ecx+9],edx             ; Write the scientific notation string
;    mov         byte ptr [ecx+13],0     ; Terminate the string ( not needed we are inside the 16 bytes )
    ret

TestExceptions:
    cmp         eax,07F800000h
    je          message_Inf
    cmp         eax,07F800001h
    je          message_SNaN
    cmp         eax,07FBFFFFFh
    je          message_SNaN
    cmp         eax,07FC00000h
    je          message_QNaN
    cmp         eax,07FFFFFFFh
    je          message_QNaN
    cmp         eax,0FFC00001h
    je          message_QnegNaN
    cmp         eax,0FFBFFFFFh
    je          message_SnegNaN
    cmp         eax,0FF800001h
    je          message_SnegNaN
    cmp         eax,0FFC00000h
    je          message_Indeterm
    cmp         eax,0FF800000h
    je          message_NegInf
    cmp         eax,0FFFFFFFFh
    je          message_QnegNaN
	jmp			ProcessReal4 			; No exceptions found, proceed...  
message_QnegNaN:
    movaps  xmm0,oword ptr szQnegNaN
    movaps  oword ptr [ecx],xmm0
    ret
message_SnegNaN:
    movaps  xmm0,oword ptr szSnegNaN
    movaps  oword ptr [ecx],xmm0
    ret
message_Indeterm:
    movaps  xmm0,oword ptr szIndeterm
    movaps  oword ptr [ecx],xmm0
    ret
message_NegInf:
    movaps  xmm0,oword ptr szNegInf
    movaps  oword ptr [ecx],xmm0
    ret
;message_NegNorm:
;    movaps  xmm0,oword ptr szNegNorm
;    movaps  oword ptr [ecx],xmm0
;    ret
;message_Norm:
;    movaps  xmm0,oword ptr szNorm
;    movaps  oword ptr [ecx],xmm0
;    ret
message_Inf:
    movaps  xmm0,oword ptr szInf
    movaps  oword ptr [ecx],xmm0
    ret
message_SNaN:
    movaps  xmm0,oword ptr szSNaN
    movaps  oword ptr [ecx],xmm0
    ret
message_QNaN:
    movaps  xmm0,oword ptr szQNaN
    movaps  oword ptr [ecx],xmm0
    ret
message_PosZero:
    movaps  xmm0,oword ptr szPosZero
    movaps  oword ptr [ecx],xmm0
    ret
message_NegZero:
    movaps  xmm0,oword ptr szNegZero
    movaps  oword ptr [ecx],xmm0
    ret

Real4_2_ASCII endp

guga · May 29, 2025, 12:53:32 AM

Tks a lot SiekManski

I´ll take a look.

Quote64 bit exp: 4 float: 1.000000e+004 power10: 1.000000e+002 result: 1000000.000000

64 bit exp: 0 float: 1.000000e-045

mxcsr_register: 1FA2h exp: 0 mantissa: 00000001h
PowerOfTen: 1.401298e-045 00000001h

SIMD Real4 to ASCII conversion by Siekmanski 2018.

1000000 calls per Run for the Cycle counter and the Routine timer.

AMD Ryzen 5 2400G with Radeon Vega Graphics

Routine timers running now....

Real4_2_ASCII Cycles: 86 RoutineTime: 0.021150000 seconds
sprintf Cycles: 1960 RoutineTime: 0.554756400 seconds

Result Real4_2_ASCII: 3.402823e+38
Result sprintf : 3.402823e+038

Press any key to continue...

Btw...This article is really interesting

daydreamer · May 29, 2025, 12:09:15 PM

Great code Siekmanski

The MASM Forum

News:

FputoString format

NoCforMe

guga

NoCforMe

fearless

Siekmanski

guga

daydreamer