News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

FPU use, C vs MASM

Started by KeepingRealBusy, December 26, 2012, 01:38:53 PM

Previous topic - Next topic

KeepingRealBusy

I have recently converted ENT to handle huge files. With such files, it became too slow, so I re-implemented it in MASM.

I created several test files to check out the handling of the various ENT options for folding UPPER case to lower case (but as ISO), and used some early version of ENT as a test file (117 MB, but would not be changing with any new additions for debugging ENT so the base test case always remained constant over different executable versions). I created a large batch check function to test ENT vs MASMENT for all of the test files with all of the applicable options, then ran my DIFF against the ENT/MASM files for each test case to see if the results were the same. I found  and fixed various trivial errors, primarily the spacing on the lines just so the files compared with no errors other than the start times and end times.

I was looking at the results for executing against the ENT.EXE as a data file. There were differences in the Monte Carlo PI simulation value compared with PI. I printed out several 100 lines of the intermediate values used to create the MCPI for both ENT and MASMENT executions. The problem was that ENT was coming up with more points on the circle than my MASM version (the same number of input bytes were reported in both programs). I toyed around with this for a bit, and finally decided that the problem may have been rounding errors (the problem did not occur with the very short test files, but was present in the 117 MB ent.exe test file). I even went so far as to change the ENT code to use my MASM code for the calculations (enclosing the code in an __asm { }block in the C program, commenting out the C code. The problem still persisted.

I was not sure about C I/O, especially using a C executable as a data file, especially since the supplied source files came up with warnings about "unknown publisher". I did not know whether any of my problems were due to C reading the file differently (it was opened as binary read and also forced to binary in case this was a console input execution). I changed the test file to use my MASMENT executable (33 MB). This resulted in matching output, but the file was much smaller than the 117 MB ENT executable test file. So I copied the MASMENT executable 4 times to create a 131 MB file. This larger file also had the count differences. Note, the input data BYTE counts matched, the code to create the test was a copy of my MASM code, but it still came up mismatched.

I guess I will have to create a file in the Monte Carlo function, write out the bytes used for each coordinate pair (just 6 BYTE blocks from the file, 3 BYTES for each x or y coordinate), and write out any spare bytes (less than 6) at the end of the input processing, then close the file. Then I can then use FC to compare the data against the test file to see where the files start differing. I don't know what else I can do - the intermediate data is 4 lines  for each coordinate pair for 131 KB/6 pairs - too much to look through manually.

Any thoughts?

Dave.

MichaelW

FWIW, I was able to compile ent.c and the supporting modules (all with no changes) to a Win32 console app with the Microsoft VC Toolkit 2003 compiler, using this batch file:

set PATH=C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin;%PATH%
set INCLUDE=C:\Program Files\Microsoft SDK\include;C:\Program Files\Microsoft Visual C++ Toolkit 2003\include;%INCLUDE%
set LIB=C:\Program Files\Microsoft SDK\lib;C:\Program Files\Microsoft Visual C++ Toolkit 2003\lib;%LIB%

cl /D _MSDOS /W4 ent.c iso8859.c randtest.c

pause


And with the original test, defined in entest.bat and using entitle.gif as the target, it produced exactly the same results as the original DOS app.

Next I'm going to remove the iso8859 stuff and try doing my own MASM version.


Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Dave,

That sounds like a very thorough and tedious job you have done, so don't get offended if what I suggest is so obvious that you've done it already:

The FPU hardware is the same in both cases, the data formats, too, I suppose. So it could be FPU settings. Precision is max, I assume, so at REAL10 it may produce such rare differences if the rounding bit is set to near or down. It may even change over the course of a long complex program, if the compiler feels like doing that. So one could try to introduce breakpoints in various stages and check the FPU control word - Olly is quite handy for that.

I have no better ideas, sorry - and anyway I am curious to see this project working :t

Jochen

MichaelW

After I modified ent.c to display the value of the FPU control word at the start of main I got 27Fh. After I added:

__asm finit

Above the code that stored and displayed the control word I got 37Fh. The finit changed the precision control from 10b to 11b, as expected, but did not change the rounding control. With the finit the results for the original test were exactly the same as they were without the finit. AFAIK the only FPU operations that are affected by the precision control are sin, cos, reciprocal, and reciprocal square root. In the C source I see a likely reciprocal and none of the others, but I'm not sure what operations the pow function, called to initialize the Monte Carlo in-circle distance, performs.

And thinking more about it, that statement seems a likely candidate for a translation error.

incirc = pow(pow(256.0, (double) (MONTEN / 2)) - 1, 2.0);


Dave,

If you provide your code for that statement, or any other suspect statement, I can probably include it as inline assembly and test it against the original statement.

Well Microsoft, here's another nice mess you've gotten us into.

KeepingRealBusy

Michael and Jochen

Here is what I see so far, but I may be doing something wrong in the C code, my MASM code works for this debug function.


        The indented code is from my modified ENT.C code
The left justified code is my additions to try to copy the input to the created
file "entdbg.dbg"

I can create the file (I started with an __asm int 3 in the fopen code), but my
fputc fails (I take the int 3).

What is the value of EOF? I will be writing characters such as '\0' (a null)
among others.

All I know is that without the int 3 in the fputc, ENT runs to completion, but
does not produce the output file "entdbg.dbg".

My masm version correctly executes, copies the input data to masmdbg.,dbg and
this file matches my input file (FC reports no errors for a binary compare).

long long totalc = 0;       /* Total character count */ /* note that this is modified to hold huge counts */

int DebugMode = TRUE;

FILE *fp = stdin;
FILE *fd;

           if ((fp = fopen(argv[optind], "rb")) == NULL) {
              printf("Cannot open file %s\n", argv[optind]);
      return 2;
   }

if (DebugMode) {
   if ((fd = fopen("entdbg.dbg", "wb")) == NULL) {
      printf("Cannot create file entdbg.dbg\n");
      return 2;
   }
}

   ocb = (unsigned char) oc;
if (DebugMode) {
   if (fputc(ocb, fd) != EOF) {
      __asm int 3
   }
}

fclose(fp);
if (DebugMode) {
   __asm int 3
   fclose(fd);
}



Dave

KeepingRealBusy

Quote from: MichaelW on December 26, 2012, 04:57:29 PM
FWIW, I was able to compile ent.c and the supporting modules (all with no changes) to a Win32 console app with the Microsoft VC Toolkit 2003 compiler, using this batch file:

set PATH=C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin;%PATH%
set INCLUDE=C:\Program Files\Microsoft SDK\include;C:\Program Files\Microsoft Visual C++ Toolkit 2003\include;%INCLUDE%
set LIB=C:\Program Files\Microsoft SDK\lib;C:\Program Files\Microsoft Visual C++ Toolkit 2003\lib;%LIB%

cl /D _MSDOS /W4 ent.c iso8859.c randtest.c

pause


Michael,

I will look over your code, maybe I"ll find out what I am doing wrong.

Concerning the ISO code. What a learning experience. It turns out that C is using a big endian bit map and you need to modify the bitmap array values (except for the 255's) if you want to use an ASM BT instruction (which expects a little endian array). The end points are incorrect, but you do not want to reverse the order of the bits, just shift them left for the front of a string, and right at the end of the string.

Dave.

And with the original test, defined in entest.bat and using entitle.gif as the target, it produced exactly the same results as the original DOS app.

Next I'm going to remove the iso8859 stuff and try doing my own MASM version.

KeepingRealBusy

Quote from: jj2007 on December 26, 2012, 07:58:37 PM
Dave,

That sounds like a very thorough and tedious job you have done, so don't get offended if what I suggest is so obvious that you've done it already:

The FPU hardware is the same in both cases, the data formats, too, I suppose. So it could be FPU settings. Precision is max, I assume, so at REAL10 it may produce such rare differences if the rounding bit is set to near or down. It may even change over the course of a long complex program, if the compiler feels like doing that. So one could try to introduce breakpoints in various stages and check the FPU control word - Olly is quite handy for that.

I have no better ideas, sorry - and anyway I am curious to see this project working :t

Jochen

Jochen,

The reason I mentioned precision is that ENT is not using real10, it is using doubles. My MASM version does use tbytes (real10).

I would like to post my version, but I am uncertain about the legalities.

Dave.

KeepingRealBusy

Quote from: KeepingRealBusy on December 27, 2012, 11:54:46 AM
Michael and Jochen

Here is what I see so far, but I may be doing something wrong in the C code, my MASM code works for this debug function.


        The indented code is from my modified ENT.C code
The left justified code is my additions to try to copy the input to the created
file "entdbg.dbg"

I can create the file (I started with an __asm int 3 in the fopen code), but my
fputc fails (I take the int 3).

What is the value of EOF? I will be writing characters such as '\0' (a null)
among others.

All I know is that without the int 3 in the fputc, ENT runs to completion, but
does not produce the output file "entdbg.dbg".

My masm version correctly executes, copies the input data to masmdbg.,dbg and
this file matches my input file (FC reports no errors for a binary compare).

long long totalc = 0;       /* Total character count */ /* note that this is modified to hold huge counts */

int DebugMode = TRUE;

FILE *fp = stdin;
FILE *fd;

           if ((fp = fopen(argv[optind], "rb")) == NULL) {
              printf("Cannot open file %s\n", argv[optind]);
      return 2;
   }

if (DebugMode) {
   if ((fd = fopen("entdbg.dbg", "wb")) == NULL) {
      printf("Cannot create file entdbg.dbg\n");
      return 2;
   }
}

   ocb = (unsigned char) oc;
if (DebugMode) {
   if (fputc(ocb, fd) != EOF) {
      __asm int 3
   }
}

fclose(fp);
if (DebugMode) {
   __asm int 3
   fclose(fd);
}



Dave

I found my error in this code:


if (DebugMode) {
   if (fputc(ocb, fd) != EOF) {
      __asm int 3
   }
}


should be:


if (DebugMode) {
   if (fputc(ocb, fd) == EOF) {     /* this is the change, == and not != */
      __asm int 3
   }
}


Now the ENT program runs to completion, and the debug file matches the input file, but the original program output still has a different in_circle count.

Note: I was writing the file immediately after the fgetc, before it was passed to randtest.c. I will have randtest.c pass the character back and then write it out to the debug file and see if it is being modified in randtest,c.

Dave.

KeepingRealBusy

Michael, Jochen,

I moved the fputc code to follow the call to randtest.c. Still runs to completion and writes the debug file and the debug file matches the input file, but the normal output results still show a difference in the in_circle counts:


------------------------------------------------------------------ 2
The current time is: 20:28:26.20
-------------------------------------------------------------- 2
The current time is: 20:28:26.29
------------------------------------------------------------------ 5
                            inmont = 15564
                            mcount = 22187
      (((double) inmont) / mcount) =  0.701492
(4.0*(((double) inmont) / mcount)) =  2.805967
-------------------------------------------------------------- 5
------------------------------------------------------------------ 13
of this 133124 byte file by 39 percent.
-------------------------------------------------------------- 9
of this 133124 byte file by 40 percent.
------------------------------------------------------------------ 19
Monte Carlo value for Pi is 2.805967458 (error 10.68 percent).
-------------------------------------------------------------- 15
                            inmont = 15604
                            mcount = 22187
      (((double) inmont) / mcount) =  0.703295
(4.0*(((double) inmont) / mcount)) =  2.813179
Monte Carlo value for Pi is 2.813178889 (error 10.45 percent).
------------------------------------------------------------------ 22
The current time is: 20:28:26.27
-------------------------------------------------------------- 22
The current time is: 20:28:26.55


Also note that the percentage difference differs 39 vs 40. This appears on several of the test cases and I can say that this is definitely a rounding problem by the C version because I took a debug break at the printf time and the MASM version calculation showed 39.7 which was rounded to 40 for output. The C code runs off to some dll code to do some sse manipulations, and probably does not round but instead it truncates.

Remember, the calculations for Monte Carlo are now being made in integer calculations, identical to my MASM code version, so rounding should not be a problem.

Enough for tonight.

PS: Michael,  My C compile was with VS 2008, in case that makes any difference.

Dave.

jj2007

Quote from: KeepingRealBusy on December 27, 2012, 12:18:12 PM
The reason I mentioned precision is that ENT is not using real10, it is using doubles. My MASM version does use tbytes (real10).

There you have one obvious source of differences. Your Masm version is "better" but different...
Try what happens if you use REAL8 in Masm, too.

KeepingRealBusy

Jochen,

I tried that, but it didn't help. The reason is the optimizing compiler. The compiler sets up strings of FPU instructions with few saves and loads between segments of processing. In that case, the FPU is using real10 values for all intermediate calculations anyway. However, the Monte Carlo calculations were separate had saves into real8.

Trying to read the .cod file to see what the compiler was doing was interesting, so say the least. The .cod file is a sight to CAUSE sore eyes not a sight FOR sore eyes:


$LN32@main:
; Line 186
  0020a 8b 5c 24 44 mov ebx, DWORD PTR _fp$[esp+2240]
  0020e 68 00 80 00 00 push 32768 ; 00008000H
  00213 53 push ebx
  00214 e8 00 00 00 00 call __fileno
  00219 83 c4 04 add esp, 4
  0021c 50 push eax
  0021d e8 00 00 00 00 call __setmode
; Line 189
  00222 8b 7c 24 34 mov edi, DWORD PTR _binary$[esp+2248]
  00226 83 c4 08 add esp, 8
  00229 c7 44 24 3c 00
00 00 00 mov DWORD PTR _samp$[esp+2240], OFFSET ??_C@_03EDMBFDDG@bit?$AA@
  00231 3b fe cmp edi, esi
  00233 75 08 jne SHORT $LN53@main
  00235 c7 44 24 3c 00
00 00 00 mov DWORD PTR _samp$[esp+2240], OFFSET ??_C@_04IHGKJMLH@byte?$AA@
$LN53@main:
; Line 190


And the VS 2008 compiler no longer supports adding the source code lines to the .cod file, at best you get a line number, and not all of them and, not in order.

Dave.

jj2007

Quote from: KeepingRealBusy on December 28, 2012, 02:13:47 AMthe FPU is using real10 values for all intermediate calculations anyway

Which are affected by the precision set:

include \masm32\MasmBasic\MasmBasic.inc        ; download
        Init

        finit
        mLoops=25
        FpuSet MbNear64
        push 12345
        fild dword ptr [esp]
        fldpi
        fld st(1)
        REPEAT mLoops
                fmul st, st(1)
        ENDM
        deb 4, "Near64", ST(0), ST(1), ST(2)

        finit
        FpuSet MbNear24
        push 12345
        fild dword ptr [esp]
        fldpi
        fld st(1)
        REPEAT mLoops
                fmul st, st(1)
        ENDM
        deb 4, "Near24", ST(0), ST(1), ST(2)

        Inkey
        Exit
end start

Output:
Near64
ST(0)           33131256869752803.4
ST(1)           3.14159265358979324
ST(2)           12345.0000000000000

Near24
ST(0)           33131259609743360.0
ST(1)           3.14159265358979324
ST(2)           12345.0000000000000


MichaelW

I see now that in my post above I made a conceptual error, and it's not the first time for this particular error:

QuoteAFAIK the only FPU operations that are affected by the precision control are sin, cos, reciprocal, and reciprocal square root.
The correct concept is:

The only FPU operations where the execution speed is affected by the precision control are sin, cos, reciprocal, and reciprocal square root.

That said, in my tests I got the same results whether the precision was 53 bits or 64 bits.

Well Microsoft, here's another nice mess you've gotten us into.

dedndave

i suppose it takes slightly longer to transfer 10 bytes, rather than 8   :P
of course, you can use real8's even when in 64-bit (80-bit) precision

KeepingRealBusy

To all,

The problem was mine (I will not display the stupid dumb error, but it was fun finding). I now match the counts from ENT.

Still have rounding errors. I deleted all my debug output and added an extra value (floating) where the rounding error occurred in both the masm and ent versions:


------------------------------------------------------------------ 15
of this 6 byte file by 91 (91.87) percent.
-------------------------------------------------------------- 15
of this 6 byte file by 92 (91.87) percent.


This appears to happen when a floating value is cast as a (short):


           printf("of this %ld %s file by %d percent.\n\n", totalc, samp,
    (short) ((100 * ((binary ? 1 : 8) - ent) /
      (binary ? 1.0 : 8.0))));


I will try to get around the error in ENT by manually adding 0.5 to the floating value before the cast.

There is another rounding difference when outputting the counts:


------------------------------------------------------------------ 77
97   a                             2   0.007813
98   b                             2   0.007813
99   c                             2   0.007813
100   d                             2   0.007813
101   e                             2   0.007813
102   f                             2   0.007813
103   g                             2   0.007813
104   h                             2   0.007813
105   i                             2   0.007813
106   j                             2   0.007813
107   k                             2   0.007813
108   l                             2   0.007813
109   m                             2   0.007813
110   n                             2   0.007813
111   o                             2   0.007813
112   p                             2   0.007813
113   q                             2   0.007813
114   r                             2   0.007813
115   s                             2   0.007813
116   t                             2   0.007813
117   u                             2   0.007813
118   v                             2   0.007813
119   w                             2   0.007813
120   x                             2   0.007813
121   y                             2   0.007813
122   z                             2   0.007813
-------------------------------------------------------------- 77
97   a                             2  0.007812
98   b                             2  0.007812
99   c                             2  0.007812
100   d                             2  0.007812
101   e                             2  0.007812
102   f                             2  0.007812
103   g                             2  0.007812
104   h                             2  0.007812
105   i                             2  0.007812
106   j                             2  0.007812
107   k                             2  0.007812
108   l                             2  0.007812
109   m                             2  0.007812
110   n                             2  0.007812
111   o                             2  0.007812
112   p                             2  0.007812
113   q                             2  0.007812
114   r                             2  0.007812
115   s                             2  0.007812
116   t                             2  0.007812
117   u                             2  0.007812
118   v                             2  0.007812
119   w                             2  0.007812
120   x                             2  0.007812
121   y                             2  0.007812
122   z                             2  0.007812


I have not added the code to output the float with an extra digit to check for rounding errors in both versions. This is my next task.

About real10 vs real8. I think a timing test will show that it takes longer with real8 because the FPU must translate it to a real10 internally. With real10 you have alignment problems which can slow down the process, unless you use an ALIGN 16 in front of each real10 definition (I even use OWORDS in my arrays where each OWORD contain a  real10).

Dave.