Author Topic: Yeppp! High performance mathematical library  (Read 26389 times)

jj2007

  • Member
  • *****
  • Posts: 12676
  • Assembler is fun ;-)
    • MasmBasic
Re: Yeppp! High performance mathematical library
« Reply #30 on: October 12, 2013, 07:18:17 AM »
This simple demo sums up two byte size arrays:

include \masm32\MasmBasic\MasmBasic.inc        ; download

.data        ; define two source arrays:
arrs1        db 10, 20, 30
arrs2        db 100, 100, 100

.data?        ; destination array
arrdest        db 3 dup(?)

Enum 0:YepStatusOk, YepStatusNullPointer, YepStatusMisalignedPointer, YepStatusInvalidArgument
Enum YepStatusInvalidData, YepStatusInvalidState, YepStatusUnsupportedHardware, YepStatusUnsupportedSoftware
Enum YepStatusInsufficientBuffer, YepStatusOutOfMemory, YepStatusSystemError, YepStatusAccessDenied

  Init
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
  Declare yepCore_Add_V8sV8s_V8s, C:4        ; C calling convention, 4 arguments
  Declare yepLibrary_Init, 0

  void yepLibrary_Init()
  .if yepCore_Add_V8sV8s_V8s(offset arrs1, offset arrs2, offset arrdest, lengthof arrdest)==YepStatusOk
        movzx eax, byte ptr arrdest[0]
        Print Str$("Element 0=\t%i\n", eax)
        movzx eax, byte ptr arrdest[1]
        Print Str$("Element 1=\t%i\n", eax)
        movzx eax, byte ptr arrdest[2]
        Inkey Str$("Element 2=\t%i\n", eax)
  .else
        Inkey Str$("Yeppp!!!! error %i", eax)
  .endif
  Exit
end start


Output:
Element 0=      110
Element 1=      120
Element 2=      130


Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

Marat Dukhan

  • Guest
Re: Yeppp! High performance mathematical library
« Reply #31 on: October 12, 2013, 07:27:40 AM »
That's the approximation error, of course.
No, I use it to bound both the error of approximating a function with a polynomial and the error of evaluating the polynomial in IEEE arithmetic (i.e. roundoff error in polynomial evaluation).

Marat Dukhan

  • Guest
Re: Yeppp! High performance mathematical library
« Reply #32 on: October 12, 2013, 07:34:27 AM »
Yes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker[1]?
Yes, I use Dekker's double-double format.

For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
I do use SSE2/AVX/FMA3/FMA4/NEON. Here are the implementations.

jj2007

  • Member
  • *****
  • Posts: 12676
  • Assembler is fun ;-)
    • MasmBasic
Re: Yeppp! High performance mathematical library
« Reply #33 on: October 12, 2013, 08:04:06 AM »
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Let's test it:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc


(not yet a serious exercise, just a demo that C compilers can be beaten...)

Gunther

  • Member
  • *****
  • Posts: 4067
  • Forgive your enemies, but never forget their names
Re: Yeppp! High performance mathematical library
« Reply #34 on: October 12, 2013, 08:06:22 AM »
Jochen,

YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1475
  • The base type of a type is the type itself
    • SmplMath macros
Re: Yeppp! High performance mathematical library
« Reply #35 on: October 12, 2013, 08:08:06 AM »
I do use SSE2/AVX/FMA3/FMA4/NEON.
yes, but there is code that evaluates polynomials using scalar SSE2 instructions:
Quote from: log.x64-ms.asm
   MOVSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c8]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c9]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c10]
this code can be parallelized using the packed SSE2 instructions mulPd/ addPd:
P(x) = c0 + c1x + c2x2 + c3x3 + c4x4 + ...
     = c0 +  x(c1 + c3x2 + c5x4 + ...)
          + x2(c2 + c4x2 + c6x4 + ...)

the two bracket expressions can be evaluated parallel (Horner scheme).
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

  • Member
  • *****
  • Posts: 12676
  • Assembler is fun ;-)
    • MasmBasic
Re: Yeppp! High performance mathematical library
« Reply #36 on: October 12, 2013, 08:10:23 AM »
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther,
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,

  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

The ExpandEnv$() serves to make it language neutral; for example, my "Program Files" are "Programmi" because the OS is Italian.

Gunther

  • Member
  • *****
  • Posts: 4067
  • Forgive your enemies, but never forget their names
Re: Yeppp! High performance mathematical library
« Reply #37 on: October 12, 2013, 08:15:43 AM »
Jochen,

Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

Ah that's it. I'll do the installation forthwith. Anyway, here is your test result:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++7 of 20 tests valid, loop overhead is approx. 179/100 cycles

226     cycles for 100 * Yeppp! original
198     cycles for 100 * Yeppp! movzx inc

228     cycles for 100 * Yeppp! original
199     cycles for 100 * Yeppp! movzx inc

226     cycles for 100 * Yeppp! original
200     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc

--- ok ---

Gunther
Get your facts first, and then you can distort them.

Marat Dukhan

  • Guest
Re: Yeppp! High performance mathematical library
« Reply #38 on: October 12, 2013, 11:17:51 AM »
Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

jj2007

  • Member
  • *****
  • Posts: 12676
  • Assembler is fun ;-)
    • MasmBasic
Re: Yeppp! High performance mathematical library
« Reply #39 on: October 12, 2013, 04:37:01 PM »
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

Marat,
This was just an example (and the array, btw, has exactly 100 bytes ;)). Compilers have their strengths and their limitations...
Still, if you could identify a routine that...
- is frequently needed in the scientific & mathematical community
- maybe is frequently used in benchmarking mathematical libraries ;)
- is so slow that it forces people to have a coffee break
... then we could create a testbed and call it a challenge.
 :icon14:?