Author Topic: High-accuracy floating point arithmetic  (Read 8592 times)

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
High-accuracy floating point arithmetic
« on: July 24, 2012, 09:45:17 AM »
Given are 5 REAL4 numbers, organized in an array. There are 6 arrays with the same elements, but in different element order:

A = (1e20,17.0,-10.0,130.0,-1e20)
B = (1e20,-10.0,130.0,-1e20,17.0)
C = (1e20,17.0,-1e20,-10.0,130.0)
D = (1e20,-10.0,-1e20,130.0,17.0)
E = (1e20,-1e20,17.0,-10.0,130.0)
F = (1e20,17.0,130.0,-1e20,-10.0)

The task is: add the array elements and calculate the array sum. For that easy example, we don't need a computer or a pocket calculator; a bit mental arithmetic leads to 137 as the right result. Please note also that no number has a fractional part, so classical rounding errors can't occur. Nevertheless, the entire floating point calculation with REAL4, REAL8, or REAL10 values will crash and produce garbage.

I've realized the floating point calculation with the FPU under PowerBASIC 3.5 (the DOS compiler). But it should also compile with PB 3.0. It's simple 16 bit real mode code; therefore I did the upload in the 16 bit sub-forum and not in the PB forum.

The left column of the screen shot above shows the classic floating point results. Only the sum of array E is correct; the rest is junk. The reason for the calculation crash is the special data constellation. I've dozens of such ill conditioned examples; a normal naive floating point computation would fail in any case.

The right column shows the right results with the extra long dot product accumulator. The software is at this point in experimental stage, nothing is finished or fixed. The basic idea comes from the old desk calculators, for example the model MONROMATIC ASMD from 1956 produced by Monroe Calculating Machine Company, New Jersey, USA.

This desk calculator did addition, subtraction, multiplication, division and the accumulation of partial results in a long accumulator. In other words: high-accuracy arithmetic.

I'm using now REAL4 numbers and nothing else - no higher precision, no BCD arithmetic, no big number library etc. But for the multiplication the software uses a long accumulator of 48 bits for the product mantissa. For details of the REAL4 data layout, please check Raymond Filiatreault's excellent site: http://www.website.masmforum.com/tutorials/fptute/. Thank you, Raymond. The addition results are accumulated in the dot product accumulator with a length of (theoretical) 556 bits. For design reasons my accumulator is 576 bits long (72 bytes). So, there are 20 bits (2.5 bytes) for temporary overflows at the top. Furthermore, there are 3 extra bytes at the tail for the handling of underflows.

I did the same thing 30 years ago; I've found my old Z80 sources. I used this algorithms and converted the code to the 8086 platform for usage with PowerBASIC. With a lot bit witchcraft and many debugging sessions: that's the result. The software calculates sums and the dot product in high accuracy - at the moment only for REAL4 values. It handles underflows and overflows properly. It has several rounding methods: round to nearest, round to infinity, round up, round down, and intervall rounding - that's for intervall mathematics.

At the moment I'm developing a 64 bit version. I think that's the future and one has larger and more registers. That's more comfortable. If this is done, the 32 bit version will follow.

Please don't download capture.zip and monroe.zip; these are only the images above. The software archive for download is exact.zip.

Gunther                             
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1454
  • The base type of a type is the type itself
    • SmplMath macros
Re: High-accuracy floating point arithmetic
« Reply #1 on: July 24, 2012, 10:16:22 AM »
I'm not a BASIC expert, but I would say that the following loop is invalid:
Code: [Select]
  s = 0
  FOR i = 1 TO m
    s = s + V(i) 'sum up
  NEXT i
AFAIKS this loop always skip the first Element in current array and instead use the first one from the next array.
Shouldn't it be:
Code: [Select]
  s = 0
  FOR i = 0 TO 4
    s = s + V(i) 'sum up
  NEXT i
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: High-accuracy floating point arithmetic
« Reply #2 on: July 24, 2012, 10:29:22 AM »
Hi qWord,

I'm not a BASIC expert, but I would say that the following loop is invalid:
Code: [Select]
  s = 0
  FOR i = 1 TO m
    s = s + V(i) 'sum up
  NEXT i
AFAIKS this loop always skip the first Element in current array and instead use the first one from the next array.
Shouldn't it be:
Code: [Select]
  s = 0
  FOR i = 0 TO 4
    s = s + V(i) 'sum up
  NEXT i

No, not really. Please check out the option base statement near the top of the BASIC source. It's not C. I decided to start my array element count with 1. But for the C programmer's which are not so familiar with PowerBASIC, I've included the C source which leads to the same garbage, but with double values (REAL8). Here starts the array count with 0, of course. Please check it out.


Gunther
Get your facts first, and then you can distort them.

Antariy

  • Member
  • ****
  • Posts: 541
Re: High-accuracy floating point arithmetic
« Reply #3 on: July 24, 2012, 10:47:33 AM »
With a lot bit witchcraft and many debugging sessions: that's the result. The software calculates sums and the dot product in high accuracy - at the moment only for REAL4 values. It handles underflows and overflows properly. It has several rounding methods: round to nearest, round to infinity, round up, round down, and intervall rounding - that's for intervall mathematics.

Seems to be a great work, Gunther! :t

Code: [Select]
BASIC results               Extra Long Accumulator results
=============               ==============================

sum A =  0                  sum A =  137
sum B =  17                 sum B =  137
sum C =  120                sum C =  137
sum D =  147                sum D =  137
sum E =  137                sum E =  137
sum F = -10                 sum F =  137

The right result is 137 and nothing else.


Press a key to continue ...

Antariy

  • Member
  • ****
  • Posts: 541
Re: High-accuracy floating point arithmetic
« Reply #4 on: July 24, 2012, 10:52:50 AM »
C version output:

Code: [Select]
Sum 1 (FPU) = 136.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 137.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 136.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 139.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = 134.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: High-accuracy floating point arithmetic
« Reply #5 on: July 24, 2012, 11:20:57 AM »
Hi Alex,

[Seems to be a great work, Gunther! :t

Thank you, Alex. I've burned my midnight oil.

C version output:

Code: [Select]
Sum 1 (FPU) = 136.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 137.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 136.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 139.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = 134.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

As you can see: data rubbish. The crucial point is: the 64 bit compilers are using the multi media registers for floating point calculations - not the good old FPU, which has more accuracy with it's internal 80 bit format. That's very dangerous. 

Gunther
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1454
  • The base type of a type is the type itself
    • SmplMath macros
Re: High-accuracy floating point arithmetic
« Reply #6 on: July 24, 2012, 11:33:14 AM »
I like the idea  :t

has your internal FP representation also a exponent?
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: High-accuracy floating point arithmetic
« Reply #7 on: July 24, 2012, 12:25:52 PM »
Hi qWord,

I like the idea  :t

Thank you.

has your internal FP representation also a exponent?

Yes, of course. We've the real exponent (E) and the BIAS (7Fh for REAL4), which gives the characteristic C = E + BIAS. That's the pointer into the dot product accumulator.

But to be honest: I wasn't the first with that idea. Although my old Z80 code is from 1981, there are some precursors. First of all, Ulrich Kulisch from the University Kaiserslautern wrote an interesting book about interval mathematics; that's my main theoretical background in this area. He also wrote together with some students the Pascal XSC compiler. Furthermore, the high-accuracy arithmetic subroutine library for the IBM 370 series. Both, Pascal XSC and the IBM library are using similar ideas. Only the 30 year old Z80 code is my product.

The basic idea behind all that jazz is to bring the good old analogue accumulator from the desk calculators into the digital world. Anyway, we're standing on the shoulders of giants. Therefore it's worth the effort to check old algorithms and port those techniques to our modern hardware, especially with the modern compiler design in mind (see my answer to Alex above).

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 7542
  • Assembler is fun ;-)
    • MasmBasic
Re: High-accuracy floating point arithmetic
« Reply #8 on: July 24, 2012, 12:29:29 PM »
C version output:

Code: [Select]
Sum 1 (FPU) = 136.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 137.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 136.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 139.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = 134.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

First of all: excellent work, Gunther :t

But I stumbled over a problem:

Code: [Select]
Test size is REAL4
Result=136.0000000000000000
Result=137.0000000000000000
Result=136.0000000000000000
Result=139.0000000000000000
Result=137.0000000000000000
Result=134.0000000000000000

Test size is REAL10
Result=136.0000000000000000
Result=137.0000000000000000
Result=136.0000000000000000
Result=139.0000000000000000
Result=137.0000000000000000
Result=134.0000000000000000

I hope you are as surprised as I was there is no difference between REAL4 and REAL10...  but I checked it several times with Olly, and that is the outcome. Now one may ask what C compilers are using ::)

The "surprise" happens below at the "on exit" lines. You can launch the attached exe with int 3 to set the flag for Olly.

include \masm32\MasmBasic\MasmBasic.inc   ; download
.data
A_s   REAL4 1.0e20,17.0,-10.0,130.0,-1.0e20
B_s   REAL4 1.0e20,-10.0,130.0,-1.0e20,17.0
C_s   REAL4 1.0e20,17.0,-1.0e20,-10.0,130.0
D_s   REAL4 1.0e20,-10.0,-1.0e20,130.0,17.0
E_s   REAL4 1.0e20,-1.0e20,17.0,-10.0,130.0
F_s   REAL4 1.0e20,17.0,130.0,-1.0e20,-10.0

A_x   REAL10 1.0e20,17.0,-10.0,130.0,-1.0e20
B_x   REAL10 1.0e20,-10.0,130.0,-1.0e20,17.0
C_x   REAL10 1.0e20,17.0,-1.0e20,-10.0,130.0
D_x   REAL10 1.0e20,-10.0,-1.0e20,130.0,17.0
E_x   REAL10 1.0e20,-1.0e20,17.0,-10.0,130.0
F_x   REAL10 1.0e20,17.0,130.0,-1.0e20,-10.0

   Init
   FpuSet MbNear64
   
   TestSize equ <REAL4>
   mov esi, offset A_s
   mov edi, Instr_(CL$(), "INT 3", 1)   ; set a flag via commandline (case-insensitive)
   REPEAT 2
      tmp$ CATSTR <PrintLine CrLf$, "Test size is >, TestSize, <">
      tmp$
      mov ebx, 6
      .Repeat
         xor ecx, ecx
         fldz
         .if edi
            INT 3   ; greetings to Olly
         .endif
         .Repeat
            imul eax, ecx, TestSize
            fld TestSize ptr [esi+eax]
            faddp
            inc ecx
         .Until ecx>=5

         ; on exit, last element, REAL4:
            ; ST0 valid -1.0000000200408773427e+20
            ; ST1 valid 1.0000000200408773441e+20
            
         ; on exit, last element, REAL10:
            ; ST0 valid -1.000000000000000000e+20
            ; ST1 valid 1.0000000000000000014e+20

         Print Str$("Result=%Jf\n", ST(0))
         fstp st
         add esi, 5*TestSize
         dec ebx
      .Until Zero?
      mov esi, offset A_x
      TestSize equ <REAL10>
   ENDM
   Inkey
   Exit
end start

Antariy

  • Member
  • ****
  • Posts: 541
Re: High-accuracy floating point arithmetic
« Reply #9 on: July 24, 2012, 12:44:35 PM »
I hope you are as surprised as I was there is no difference between REAL4 and REAL10...  but I checked it several times with Olly, and that is the outcome. Now one may ask what C compilers are using ::)

Probably that's because after loading a REAL4 or REAL8 or REAL10 into the FPU, it anyway has internal size of a 80 bits - so, the errors should be the same. The rounding mode and the precision still has its value, though.

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: High-accuracy floating point arithmetic
« Reply #10 on: July 24, 2012, 12:56:10 PM »
Hi Jochen,

First of all: excellent work, Gunther :t

Thank you.

But I stumbled over a problem:

Code: [Select]
Test size is REAL4
Result=136.0000000000000000
Result=137.0000000000000000
Result=136.0000000000000000
Result=139.0000000000000000
Result=137.0000000000000000
Result=134.0000000000000000

Test size is REAL10
Result=136.0000000000000000
Result=137.0000000000000000
Result=136.0000000000000000
Result=139.0000000000000000
Result=137.0000000000000000
Result=134.0000000000000000

No, I'm not surprised. None of the array elements has a fractional part. The only difference between REAL4 and REAL10 is in that case the characteristic with a different BIAS. Furthermore, REAL10 hasn't the hidden bit, but never mind, because that must be ored into the REAL4 mantissa. So, the REAL10 mantissa has a lot more zeros - nothing else. But for that example is the 80 bit format (REAL10) not sufficient, although it's bit better as the XMM results. But false is false.

I hope you are as surprised as I was there is no difference between REAL4 and REAL10...  but I checked it several times with Olly, and that is the outcome. Now one may ask what C compilers are using ::)

That depends. In the 32 bit world, FPU code should be the standard, at least with the current gcc version. What VC produces must be tested. In the 64 bit world, both gcc and VC are using XMM registers with that awful results. That's the scandal.

Gunther
Get your facts first, and then you can distort them.