Author Topic: fanny test  (Read 550 times)

TimoVJL

  • Member
  • ***
  • Posts: 476
fanny test
« on: September 23, 2019, 06:52:10 PM »
when-is-assembly-faster-than-c
Code: [Select]
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
// https://stackoverflow.com/questions/577554/when-is-assembly-faster-than-c
float KahanSum(const float *data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;

for (int i = 0; i < n; i++)
{
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}

return sum;
}

float AsmSum(const float *data, int n)
{
float result = 0.0f;

_asm
{
mov esi,data
mov ecx,n
fldz
fldz
l1:
fsubr dword ptr [esi]
add esi,4
fld st(0)
fadd st(0),st(2)
fld st(0)
fsub st(0),st(3)
fsub st(0),st(2)
fstp st(2)
fstp st(2)
loop l1
fstp result
fstp result
}
return result;
}

int main(int c, char **a)
{
int count = 1000000;
float *source = malloc(count * sizeof(float));

for (int i = 0; i < count; ++i)
source[i] = (float)(rand()) / (float)(RAND_MAX);

LARGE_INTEGER start, mid, end;
float sum1 = 0.0f, sum2 = 0.0f;
QueryPerformanceCounter(&start);
sum1 = KahanSum(source, count);
QueryPerformanceCounter(&mid);
sum2 = AsmSum(source, count);
QueryPerformanceCounter(&end);
printf("  C code: %.1f in %u\n", sum1, (mid.QuadPart - start.QuadPart));
printf("asm code: %.1f in %u\n", sum2, (end.QuadPart - mid.QuadPart));
return 0;
}
Code: [Select]
using System;
using System.Diagnostics;
// c:\Windows\Microsoft.NET\Framework\v2.0.50727\csc.exe KahanSumCS.cs

class KahanSumCS
{

static float KahanSum(ref float[] data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;
for (int i = 0; i < n; i++)
{
Y = data[i] - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}
return sum;
}

static void Main()
{
Stopwatch sw;
float sum;
int count = 1000000;
float[] arr = new float[count];
Random rnd = new Random();
for (int i = 0 ; i < count ; ++i)
arr[i] = (float)rnd.NextDouble();
sw = Stopwatch.StartNew();
sum = KahanSum(ref arr, count);
sw.Stop();
long ticks = sw.ElapsedTicks;
Console.WriteLine(" C# code: {0} in {1}", sum, ticks);
}
}
results:
Pelles C
Code: [Select]
  C code: 499699.1 in 463469
asm code: 499699.1 in 200123
msvc 2019:
Code: [Select]
  C code: 500136.8 in 147582
asm code: 500136.8 in 200744
Code: [Select]
C# code: 499816 in 157791 :rolleyes:
May the source be with you

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #1 on: September 23, 2019, 10:15:55 PM »
C# is becoming so fast   
C# code: 500158.6 in 28429

MSVC :badgrin:
  C code: 500136.8 in 77576
asm code: 500136.8 in 36740

LLVM
  C code: 500136.8 in 36896
asm code: 500136.8 in 34181

Caché GB

  • Member
  • **
  • Posts: 89
  • MASM IS HOT
Re: fanny test
« Reply #2 on: September 23, 2019, 10:27:57 PM »
https://gcc.godbolt.org/

x86 msvc v19.22   -O2 -Gv  __vectorcall calling convention. (x86 and x64 only)

So thay want to compare hypersonic missiles:-

Code: [Select]
float KahanSum(float const *,int) PROC ; KahanSum, COMDAT
; _data$ = ecx
; _n$ = edx
  push esi
  xor esi, esi
  xorps xmm4, xmm4
  xorps xmm3, xmm3
  cmp edx, 4
  jl $LC12@KahanSum
  lea eax, DWORD PTR [edx-4]
  shr eax, 2
  inc eax
  lea esi, DWORD PTR [eax*4]
$LL13@KahanSum:
  movss xmm1, DWORD PTR [ecx]
  subss xmm1, xmm3
  movaps xmm2, xmm1
  addss xmm2, xmm4
  movaps xmm0, xmm2
  subss xmm0, xmm4
  movaps xmm4, xmm2
  subss xmm0, xmm1
  movss xmm1, DWORD PTR [ecx+4]
  subss xmm1, xmm0
  addss xmm4, xmm1
  movaps xmm0, xmm4
  subss xmm0, xmm2
  movss xmm2, DWORD PTR [ecx+8]
  subss xmm0, xmm1
  movaps xmm1, xmm4
  subss xmm2, xmm0
  addss xmm4, xmm2
  movaps xmm0, xmm4
  subss xmm0, xmm1
  movaps xmm1, xmm4
  subss xmm0, xmm2
  movss xmm2, DWORD PTR [ecx+12]
  add ecx, 16 ; 00000010H
  subss xmm2, xmm0
  addss xmm4, xmm2
  movaps xmm3, xmm4
  subss xmm3, xmm1
  subss xmm3, xmm2
  sub eax, 1
  jne SHORT $LL13@KahanSum
$LC12@KahanSum:
  cmp esi, edx
  jge SHORT $LN16@KahanSum
  sub edx, esi
 ;; npad 7
$LC8@KahanSum:
  movss xmm1, DWORD PTR [ecx]
  add ecx, 4
  subss xmm1, xmm3
  movaps xmm0, xmm1
  addss xmm0, xmm4
  movaps xmm3, xmm0
  subss xmm3, xmm4
  movaps xmm4, xmm0
  subss xmm3, xmm1
  sub edx, 1
  jne SHORT $LC8@KahanSum
$LN16@KahanSum:
  movaps xmm0, xmm4
  pop esi
  ret 0
float KahanSum(float const *,int) ENDP ; KahanSum

with supersonic missiles.

Code: [Select]
float AsmSum(float const *,int) PROC ; AsmSum, COMDAT

  push ebp
  mov ebp, esp
  sub esp, 12 ; 0000000cH
  push esi
  mov DWORD PTR _n$[ebp], edx
  mov DWORD PTR _data$[ebp], ecx
  mov DWORD PTR _result$[ebp], 0
  mov esi, DWORD PTR _data$[ebp]
  mov ecx, DWORD PTR _n$[ebp]

  fldz
  fldz
$l1$4:
  fsubr DWORD PTR [esi]
  add esi, 4
  fld ST(0)
  fadd ST(0), ST(2)
  fld ST(0)
  fsub ST(0), ST(3)
  fsub ST(0), ST(2)
  fstp ST(2)
  fstp ST(2)
  loop $l1$4
  fstp DWORD PTR _result$[ebp]
  fstp DWORD PTR _result$[ebp]
  movss xmm0, DWORD PTR _result$[ebp]
  pop esi
  mov esp, ebp
  pop ebp
  ret 0

float AsmSum(float const *,int) ENDP ; AsmSum

And then convince themselves "The compiler is king"


assembly programmers do not beat this horse to death, c++ ers do.

Now I could write a very long post about the difference between
real time software and event driven software, but that would
be too C++ ish.
Caché GB's 1 and 0-nly language:MASM

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #3 on: September 23, 2019, 10:52:28 PM »
When the article was written MSVC x86 C defaulted to x87 Floating Point.
SIMD instructions are indeed twice as faster in this case, so reviewed for MSVC 2019:
  C code: 500136.8 in 37074
asm code: 500136.8 in 35449

C compiler with SIMD instructions almost as fast as ASM with x87 Floating Point. Of course, all this can be optimized.

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: fanny test
« Reply #4 on: September 24, 2019, 12:01:17 AM »
 :biggrin:

Lets at least try to make a realistic measure of both code size and speed and see how this plays out. We have here two similar projects derived from the same source base. One written using the latest version of Microsoft 32-bit C/C++ Optimizing Compiler and the other written (mostly) in assembler.

uasm32 - 723K
asmc32 - 363K

Lets add MASM as well and see how they perform.
asm.bat:

@echo off
%1 -c -nologo \masm32\m32lib\*.asm > NUL

timeit.bat:

ti asm \masm32\bin\ml.exe >> result.txt
echo ..
del *.obj
ti asm \uasm\uasm32.exe >> result.txt
echo ..
del *.obj
ti asm \asmc\bin\asmc.exe >> result.txt
echo ..
del *.obj
echo. >> result.txt
type result.txt
pause

This produce the following result:

13447 ClockTicks: asm \masm32\bin\ml.exe
 6864 ClockTicks: asm \uasm\uasm32.exe
 3494 ClockTicks: asm \asmc\bin\asmc.exe

There are many ways to optimize C++ code and the same obviously also apply to assembler, so you find good and badly written code in both of these languages. The main difference is that the assembler wont optimize badly written code but the compiler have this ability.

The samples given in the link will not be optimized just by converting small parts of it to asm, so there are no magic involved. In most cases the compiler will do same, or in the way the actual test is laid out it will probably be worse (alignment/SIMD/FPU).

This means you have to write the code yourself as oppose to just translate the C code line-by-line if you plan on squeeze more juice out of it.

Example:

unsigned char FFMul(unsigned char a, unsigned char b) {
   unsigned char aa = a, bb = b, r = 0, t;
   while (aa != 0) {
      if ((aa & 1) != 0)
         r = r ^ bb;
      t = bb & 0x80;
      bb = bb << 1;
      if (t != 0)
         bb = bb ^ 0x1b;
      aa = aa >> 1;
   }
   return r;
}

The first ting you notice here is the test and the shift of aa may be done at the same time:

    shr aa,1
    .if CARRY?

The same also apply to bb and given it's a byte you may just skip the loop all together and end up with something like this:

    xor     eax,eax
    mov     dl,a
    mov     dh,b
  repeat 7
    mov     ecx,eax
    xor     cl,dl
    shr     dh,1
    cmovc   eax,ecx
    jz      toend
    shl     dl,1
    .ifc
        xor dl,0x1b
    .endif
    endm
toend:
    ret

Well, you may even optimize it more but this is what the compiler will do:

_a$ = 8    ; size = 1
_b$ = 12  ; size = 1
;unsigned char FFMul(unsigned char,unsigned char); FFMul, COMDAT
        push    ebx
        mov     bl, BYTE PTR _a$[esp]
        xor     ah, ah
        mov     bh, BYTE PTR _b$[esp]
        test    bl, bl
        je      SHORT $LN11@FFMul
        ;npad    1
        nop
$LL2@FFMul:
        mov     al, ah
        mov     dl, bl
        xor     al, bh
        and     dl, 1
        movzx   ecx, al
        movzx   eax, ah
        cmove   ecx, eax
        shr     bl, 1
        mov     ah, cl
        mov     cl, bh
        add     cl, cl
        mov     al, cl
        movzx   ecx, cl
        xor     al, 27 ; 0000001bH
        test    bh, bh
        movzx   edx, al
        cmovns  edx, ecx
        mov     bh, dl
        test    bl, bl
        jne     SHORT $LL2@FFMul
$LN11@FFMul:
        mov     al, ah
        pop     ebx
        ret     


Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
    16975 cycles, rep(1000), code(141) 0.asm: FFMul - asm
    84800 cycles, rep(1000), code( 69) 1.asm: FFMul - C++
-- test(1)
    16970 cycles, rep(1000), code(141) 0.asm: FFMul - asm
    85597 cycles, rep(1000), code( 69) 1.asm: FFMul - C++
-- test(2)
    17396 cycles, rep(1000), code(141) 0.asm: FFMul - asm
    85647 cycles, rep(1000), code( 69) 1.asm: FFMul - C++
-- test(3)
    16547 cycles, rep(1000), code(141) 0.asm: FFMul - asm
    85049 cycles, rep(1000), code( 69) 1.asm: FFMul - C++

total [0 .. 3], 1++
    67888 cycles 0.asm: FFMul - asm
   341093 cycles 1.asm: FFMul - C++

So, smaller code not necessarily faster, assembly code not necessarily better than optimized HLL but you need to test it to find out.

TimoVJL

  • Member
  • ***
  • Posts: 476
Re: funny test
« Reply #5 on: September 24, 2019, 12:01:47 AM »
sources + bins
May the source be with you

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: fanny test
« Reply #6 on: September 24, 2019, 03:15:00 AM »
Code: [Select]
float KahanSum(const float *data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;

for (int i = 0; i < n; i++)
{
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}

return sum;
}

From the looks of it C will always be zero here.

T = sum + Y;
C = T - sum - Y;
...
C = sum + Y - sum - Y;
...
KahanSum proc fp:ptr, n:sdword
    xorps xmm0,xmm0
    .for ( edx = fp, ecx = n : ecx : ecx--, edx += 4 )
        addss xmm0,[edx]
    .endf
    ret
KahanSum endp


Well, maybe I missing something here but then again it would be strange if the compiler didn't pick up on this or maybe that's the funny part.

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #7 on: September 24, 2019, 04:06:03 AM »
Well, maybe I missing something here but then again it would be strange if the compiler didn't pick up on this or maybe that's the funny part.

Yes, you are missing something  (that is, you forgot to see how KahanSum works) and that's the funny part :biggrin:.

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: fanny test
« Reply #8 on: September 24, 2019, 06:46:14 AM »
Yes, floats and a large array. At first glance I was thinking maybe this was done on purpose to test the compiler but I guess not.

It is however technically correct if the values are within limits, so if you use a smaller array size or doubles the result should be the same.
Code: [Select]
double KahanSum(const double *data, int n)
{
double sum = 0.0f, C = 0.0f, Y, T;

for (int i = 0; i < n; i++)
{
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}

return sum;
}

double KahanSum2(const double *data, int n)
{
double sum = 0.0f;

for (int i = 0; i < n; i++)
{
sum += *data++;
}
return sum;
}

int main(int c, char **a)
{
int count = 1000000;
double *source = malloc(count * sizeof(double));

for (int i = 0; i < count; ++i)
source[i] = (double)(rand()) / (double)(RAND_MAX);

LARGE_INTEGER start, mid, end;
double sum1 = 0.0f, sum2 = 0.0f;
QueryPerformanceCounter(&start);
sum1 = KahanSum(source, count);
QueryPerformanceCounter(&mid);
sum2 = KahanSum2(source, count);
QueryPerformanceCounter(&end);
printf("  sum1: %.1f in %u\n", sum1, (mid.QuadPart - start.QuadPart));
printf("  sum2: %.1f in %u\n", sum2, (end.QuadPart - mid.QuadPart));
return 0;
}

  sum1: 499538.5 in 44493
  sum2: 499538.5 in 8287

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #9 on: September 24, 2019, 06:08:20 PM »
From your example:
   printf("  sum1: %.17g in %u\n", sum1, (mid.QuadPart - start.QuadPart));
   printf("  sum2: %.17g in %u\n", sum2, (end.QuadPart - mid.QuadPart));

Output:
  sum1: 500136.81582079531 in 31403
  sum2: 500136.81582080509 in 7872

That is, you continue to miss the point:

"The Kahan summation algorithm is a method of summing a series of numbers represented in a limited precision in a way that minimises the loss of precision in the result."

Try to get the idea here:
https://rosettacode.org/wiki/Kahan_summation#C


nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: fanny test
« Reply #10 on: September 25, 2019, 01:44:05 AM »
That is, you continue to miss the point:

 :biggrin:

If you analyze the code objectively (without using Wikipedia that is) it's still just an erroneous function which adds up random numbers. Given the purpose of the Wiki algorithm is to significantly reduces the numerical error in the total obtained by adding a sequence of numbers, admittedly finite-precision floating-point in this case, the leeway for optimizing is rather limited unless you break the above rules.

  Sum1: 499538.5312500000000000 in 27139
  sum2: 499535.1562500000000000 in 7955
             3.3750000000000000 - float
  sum1: 499538.5450193875003606 in 26153
  sum2: 499538.5450193666620180 in 7549
             0.0000000208383426 - double

So you end up with something like this:
Code: [Select]
KahanSum proc fp:ptr, n:sdword
    pxor xmm1,xmm1
    xorps xmm0,xmm0
    .for ( edx = fp, ecx = n : ecx : ecx--, edx += 4 )
        movss xmm2,[edx] ; Y = data[i] - C;
        subss xmm2,xmm1
        movss xmm3,xmm0  ; T = sum + Y;
        addss xmm3,xmm2
        movss xmm1,xmm3  ; C = T - sum - Y;
        subss xmm1,xmm0
        subss xmm1,xmm2
        movss xmm0,xmm3  ; sum = T;
    .endf
    ret
KahanSum endp

total [0 .. 2], 1++
   392045 cycles 0.asm: - C++
   415142 cycles 1.asm: - asm
Quote
The samples given in the link will not be optimized just by converting small parts of it to asm, so there are no magic involved. In most cases the compiler will do same, or in the way the actual test is laid out it will probably be worse (alignment/SIMD/FPU).

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #11 on: September 25, 2019, 04:52:39 AM »
I am not particularly interested in the Kahan summation algorithm. Anyway, another round.

On the first intervention for this thread, you jumped in with some fantastic optimization results (see below) that only God knows the magic powders you give ASMC for such an astonishing performance.

Quote
13447 ClockTicks: asm \masm32\bin\ml.exe
 6864 ClockTicks: asm \uasm\uasm32.exe
 3494 ClockTicks: asm \asmc\bin\asmc.exe

Later, here you come again to tell that everybody has been distracted all these years because you found:

Quote
From the looks of it C will always be zero here


So, you reinvented a KahanSum2 function which uses 5 times less cycles than the original one:

Quote
double KahanSum2(const double *data, int n)
{
   double sum = 0.0f;

   for (int i = 0; i < n; i++)
   {
      sum += *data++;
   }
   return sum;
}

Now, you are recommending that we should "analyze the code objectively (without using Wikipedia that is)". So we are using a Wiki algorithm not an algorithm invented by a guy that received the Turing Award.

What is the next chapter?

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: fanny test
« Reply #12 on: September 25, 2019, 07:20:03 AM »
I am not particularly interested in the Kahan summation algorithm.

I never heard about it before you posted the link but it is indeed an interesting concept.

Quote
Anyway, another round.

When-is-assembly-faster-than-c:
I guess you will stick to your usual argument: Never!

Quote
On the first intervention for this thread, you jumped in with some fantastic optimization results (see below)

Thanks I guess.

Quote
that only God knows the magic powders you give ASMC for such an astonishing performance.

Sorry to disappoint you but as I pointed out there is no magic involved. It's just assembly but thanks for the compliment(s).

Quote
Later, here you come again to tell that everybody has been distracted all these years because you found:

 :biggrin:

Well, what I actually said was:
Quote
From the looks of it C will always be zero here


Which is objectively true if no error occur, but boy was I wrong about that one! After scrolling down seeing what fed into the function, realizing the whole concept was erroneous and Mr. Kahan being all cool about it and all: Great stuff  :thumbsup:

Quote
So, you reinvented a KahanSum2 function which uses 5 times less cycles than the original one:

Yes. Come to think of it, with the right type of assembler you could actually extend this to quadruple precision, but I guess Mr. Kahan wouldn't be to cool about that thought, given that would remove all the errors.

Quote
Now, you are recommending that we should

I didn't actually address you in plural there.

Quote
"analyze the code objectively (without using Wikipedia that is)".

Yes, that's what I did.

Quote
So we are using a Wiki algorithm not an algorithm invented by a guy that received the Turing Award. What is the next chapter?

You sort of lost me there but if Mr. Kahan have received an award: good on him  :thumbsup:

AW

  • Member
  • *****
  • Posts: 2435
  • Let's Make ASM Great Again!
Re: fanny test
« Reply #13 on: October 13, 2019, 05:31:44 AM »
Here we perform the KahanSum test using the Intel SPMD Program Compiler and compare with the C/C++ results.
Note that I am showing the full precision of the sums under both alternatives to confirm that results are exactly the same.

Impressive  :thumbsup:

Code: [Select]
32-bit application
  KahanSum C/C++:                        500136.8158207953093 in 3.754 ms
  KahanSum Intel SPMD Program Compiler:  500136.8158207953093 in 0.5272 ms
 
64-bit application
  KahanSum C/C++:                        500136.8158207953093 in 3.827 ms
  KahanSum Intel SPMD Program Compiler:  500136.8158207953093 in 0.4221 ms 

Full source code and .exes, built with VS 2019.