The MASM Forum

Miscellaneous => The Orphanage => Topic started by: Gunther on October 08, 2013, 12:28:08 PM

Title: Yeppp! High performance mathematical library
Post by: Gunther on October 08, 2013, 12:28:08 PM
The Intel Math Kernel Library - or short - MKL (http://software.intel.com/en-us/intel-math-kernel-library-intel-mkl-purchase) is well known, but not so cheap. Of course, there are alternatives. For example, AMD has a similar solution. (http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/) But both libraries work perfectly with the corresponding CPU. This is a clear disadvantage.

There is a relative new, cross-platform alternative: Yeppp! (http://www.yeppp.info/index.html) It's used by the CERN data department and seems to be not so bad. Interesting enough: The Yeppp! kernel is written in assembly language (with a bit Python help). They have 3 simple design principles:
Very interesting and the results speak for themselves.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: dedndave on October 09, 2013, 04:45:21 AM
this is an ASM forum, Gunther
we try to make our own stuff    :biggrin:
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 09, 2013, 05:01:29 AM
Hi Dave,

Quote from: dedndave on October 09, 2013, 04:45:21 AM
this is an ASM forum, Gunther
we try to make our own stuff    :biggrin:

Good idea. We can try to make our own HPC library. Why not? Any ideas?

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: dedndave on October 09, 2013, 06:30:11 AM
not at the moment - lol
(brain fart)

over time, forum members have introduced a number of routines, though
the old forum archive is full of great code and information
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 09, 2013, 07:10:59 AM
Quote from: dedndave on October 09, 2013, 06:30:11 AM
over time, forum members have introduced a number of routines, though
the old forum archive is full of great code and information

Right, let's think about that.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: qWord on October 09, 2013, 09:58:09 AM
A while back I've played a lot with table based function approximations, especially of exp(x) and log2(x). What I've found is that there is (IMO) no way around table based methods to get acceptable precision and speed. After taking a quick look to Yeppp's source, I currently have doubts on the efficient of used algorithms, because they use polynomials of 1xth degree.

BTW, a good book on this subject is "Elementary Functions Algorithms and Implementation" by Jean-Michel Muller.
An also interesting paper is "An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard" by S. Gal and B. Bachells. I've used the accurate tables method they descripted for exp(x). After implementing it, I did a compare to the FPU and HPALIB (112 bit precision) with the disillusioning result that the relative error is less than 4E-14 over the whole range (REAL8). However, I've noticed that the MSVCRT use exact the same method (=same error) thus I stay with it  :biggrin:.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 09, 2013, 10:21:37 AM
Hi qWord,

Quote from: qWord on October 09, 2013, 09:58:09 AM
BTW, a good book on this subject is "Elementary Functions Algorithms and Implementation" by Jean-Michel Muller.
An also interesting paper is "An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard" by S. Gal and B. Bachells. I've used the accurate tables method they descripted for exp(x). After implementing it, I did a compare to the FPU and HPALIB (112 bit precision) with the disillusioning result that the relative error is less than 4E-14 over the whole range (REAL8). However, I've noticed that the MSVCRT use exact the same method (=same error) thus I stay with it  :biggrin:.

An interesting point of view. Do you have a link to the paper by Gal and Bachells?

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: qWord on October 09, 2013, 10:30:10 AM
Quote from: Gunther on October 09, 2013, 10:21:37 AMDo you have a link to the paper by Gal and Bachells?
I'm afraid is not free downloadable. IIRC I've got  from the ACM Digital Library database (or maybe IEEE Xplore).
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 09, 2013, 11:33:45 AM
Quote from: qWord on October 09, 2013, 10:30:10 AM
I'm afraid is not free downloadable. IIRC I've got  from the ACM Digital Library database (or maybe IEEE Xplore).

That's bad luck. I would be interested to read the paper.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: dedndave on October 09, 2013, 01:07:33 PM
http://en.wikipedia.org/wiki/Gal%27s_accurate_tables (http://en.wikipedia.org/wiki/Gal%27s_accurate_tables)

at the bottom of the page is a link to a PDF: "Gal's Accurate Tables Method Revisited"
Title: Re: Yeppp! High performance mathematical library
Post by: anta40 on October 09, 2013, 02:46:08 PM
Quote
Yeppp! binaries for Windows, Android, Mac OS X, and Linux.
Bindings for Java, .Net/Mono, and FORTRAN.

Whoa. This is awesome.
But wait... my numerical analysis skill is already rusty  :redface:
Need some refreshing, I guess  :biggrin:
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 09, 2013, 10:07:09 PM
Dave,

Quote from: dedndave on October 09, 2013, 01:07:33 PM
http://en.wikipedia.org/wiki/Gal%27s_accurate_tables (http://en.wikipedia.org/wiki/Gal%27s_accurate_tables)

at the bottom of the page is a link to a PDF: "Gal's Accurate Tables Method Revisited"

Thank you for the link. It's an interesting article.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: KeepingRealBusy on October 10, 2013, 05:45:49 AM
Dave,

What I saw was a (I believe) a French paper that was being automatically being translated into English (if you jump from one  page past the next page, the displayed page is blank and is filled in line by line).

I wonder how accurate the translation is?

Dave.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 10, 2013, 05:50:53 AM
Hi Dave,

Quote from: KeepingRealBusy on October 10, 2013, 05:45:49 AM
What I saw was a (I believe) a French paper that was being automatically being translated into English (if you jump from one  page past the next page, the displayed page is blank and is filled in line by line).

I'm not sure. What they have done is the Resume (abstract) in French. The text is probably written in English.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: KeepingRealBusy on October 10, 2013, 05:53:48 AM
Gunther,

I guess with a math paper, the translation would probably be accurate in any case.

Dave.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 10, 2013, 09:32:01 AM
Dave,

Quote from: KeepingRealBusy on October 10, 2013, 05:53:48 AM
I guess with a math paper, the translation would probably be accurate in any case.

Dave.

I think no one would read it if it were not written in English. It is the lingua franca in science.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: Adamanteus on October 11, 2013, 08:33:14 AM
Quote from: Gunther on May 19, 1974, 02:12:08 AM

Good idea. We can try to make our own HPC library. Why not? Any ideas?

Gunther

That's possible to join entusiasts on Math DLL Project (http://www.harrix.org), on old site version there is it.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 11, 2013, 09:02:07 AM
Quote from: Adamanteus on October 11, 2013, 08:33:14 AM
That's possible to join entusiasts on Math DLL Project (http://www.harrix.org), on old site version there is it.

Couldn't find a HPC approach.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 11, 2013, 06:06:32 PM
Yeppp! author here and ready to answer your questions!
Title: Re: Yeppp! High performance mathematical library
Post by: jj2007 on October 11, 2013, 07:05:51 PM
Quote from: Marat Dukhan on October 11, 2013, 06:06:32 PM
Yeppp! author here and ready to answer your questions!

Yeppp, that's an excellent service!

Welcome to the Forum. We have some looooong threads aiming at beating the best C compilers with even better hand-crafted assembler. If you have a frequently-used Yeppp! algo that needs a speedup, this is your chance :bgrin:

But of course, some of us will also be interested to produce an interface Yeppp! - Masm. GSL has already one. (http://www.masmforum.com/board/index.php?topic=16725.0)
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 11, 2013, 07:42:32 PM
Quote from: jj2007 on October 11, 2013, 07:05:51 PM
If you have a frequently-used Yeppp! algo that needs a speedup, this is your chance :bgrin:
All of Yeppp! is a collection of algos which need speedups. Contributors are welcome! :icon14:

Quote from: jj2007 on October 11, 2013, 07:05:51 PM
But of course, some of us will also be interested to produce an interface Yeppp! - Masm.
This is easy. Most of Yeppp! is auto-generated, so one would only need to patch the code-generator (https://bitbucket.org/MDukhan/yeppp/src/52b388a8ddfef957384b7bb6096bf4aba89d31b3/codegen/yeppp/codegen.py?at=default#cl-317).
Title: Re: Yeppp! High performance mathematical library
Post by: dedndave on October 11, 2013, 08:11:17 PM
welcome to the forum   :t

i haven't used the library yet, but i have looked it over
looks very interesting   :biggrin:
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 11, 2013, 10:25:25 PM
Marat,

first things first: Welcome to the forum.

It's good to know that another experienced programmer is floating around here. I hope you'll have fun.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: Adamanteus on October 12, 2013, 02:33:19 AM
Quote from: Gunther on October 11, 2013, 09:02:07 AM
Couldn't find a HPC approach.

Gunther

YES - there only algos, but using threads or standalone MPI-like libraries is deal for it - or, "two in one" only modern  :eusa_boohoo:
Title: Re: Yeppp! High performance mathematical library
Post by: qWord on October 12, 2013, 04:29:00 AM
Marat Dukhan,

is there any documentation (or literature) on the used methods for approximation? What precision-reference did you use to test the math functions (the folder unit-test/source/math is empty).

regards, qWord
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 12, 2013, 04:49:15 AM
Quote from: qWord on October 12, 2013, 04:29:00 AM
is there any documentation (or literature) on the used methods for approximation? What precision-reference did you use to test the math functions (the folder unit-test/source/math is empty).

The framework I developed for testing performance & accuracy is open sourced at bitbucket.org/MDukhan/hysteria (http://bitbucket.org/MDukhan/hysteria). For accuracy testing it computes functions to 160-bit precision with GMP, and then converts the result to double-double. I also work on mathematical proofs of error bounds, and plan to release a tech report on that.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 12, 2013, 05:11:36 AM
Marat,

Quote from: Marat Dukhan on October 12, 2013, 04:49:15 AM
The framework I developed for testing performance & accuracy is open sourced at bitbucket.org/MDukhan/hysteria (http://bitbucket.org/MDukhan/hysteria). For accuracy testing it computes functions to 160-bit precision with GMP, and then converts the result to double-double. I also work on mathematical proofs of error bounds, and plan to release a tech report on that.

Thank you for your fast reply. Your test framework looks impressive. Is double-double 128 bit? Which kind of error analysis do you use?

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 12, 2013, 05:41:21 AM
Quote from: Gunther on October 12, 2013, 05:11:36 AM
Is double-double 128 bit?
Yes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.

Quote from: Gunther on October 12, 2013, 05:11:36 AM
Which kind of error analysis do you use?
To analyse error of interpolating polynomial, I split function domain into intervals of constant ULP, bound the absolute error of approximation (with Sollya (http://sollya.gforge.inria.fr/)) and evaluation (with Gappa (http://gappa.gforge.inria.fr/)), compute the total ULP error on each interval, and choose the maximum of all intervals. Other kinds of errors (range reduction and reconstruction) are analysed on case-by-case basis.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 12, 2013, 06:21:26 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AM
To analyse error of interpolating polynomial, I split function domain into intervals of constant ULP, bound the absolute error of approximation (with Sollya (http://sollya.gforge.inria.fr/)) and evaluation (with Gappa (http://gappa.gforge.inria.fr/)), compute the total ULP error on each interval, and choose the maximum of all intervals. Other kinds of errors (range reduction and reconstruction) are analysed on case-by-case basis.

That's the approximation error, of course. My question has two directions. The other is: What's with the numerical stability? Do you use some backward error analysis (introduced by Wilkinson (http://computer.org/computer-pioneers/wilkinson.html)) or an error interval (see my PM)?

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: qWord on October 12, 2013, 06:25:15 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker[1]?

Quote from: Marat Dukhan on October 11, 2013, 07:42:32 PM
All of Yeppp! is a collection of algos which need speedups. Contributors are welcome! :icon14:
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.

[1]T.J. Dekker, "A Floating-Point Technique for Extending the Available Precision", Springer
Title: Re: Yeppp! High performance mathematical library
Post by: jj2007 on October 12, 2013, 07:18:17 AM
This simple demo sums up two byte size arrays:

include \masm32\MasmBasic\MasmBasic.inc        ; download (http://masm32.com/board/index.php?topic=94.0)

.data        ; define two source arrays:
arrs1        db 10, 20, 30
arrs2        db 100, 100, 100

.data?        ; destination array
arrdest        db 3 dup(?)

Enum 0:YepStatusOk, YepStatusNullPointer, YepStatusMisalignedPointer, YepStatusInvalidArgument
Enum YepStatusInvalidData, YepStatusInvalidState, YepStatusUnsupportedHardware, YepStatusUnsupportedSoftware
Enum YepStatusInsufficientBuffer, YepStatusOutOfMemory, YepStatusSystemError, YepStatusAccessDenied

  Init
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
  Declare yepCore_Add_V8sV8s_V8s, C:4        ; C calling convention, 4 arguments
  Declare yepLibrary_Init, 0

  void yepLibrary_Init()
  .if yepCore_Add_V8sV8s_V8s(offset arrs1, offset arrs2, offset arrdest, lengthof arrdest)==YepStatusOk
        movzx eax, byte ptr arrdest[0]
        Print Str$("Element 0=\t%i\n", eax)
        movzx eax, byte ptr arrdest[1]
        Print Str$("Element 1=\t%i\n", eax)
        movzx eax, byte ptr arrdest[2]
        Inkey Str$("Element 2=\t%i\n", eax)
  .else
        Inkey Str$("Yeppp!!!! error %i", eax)
  .endif
  Exit
end start


Output:
Element 0=      110
Element 1=      120
Element 2=      130


Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 12, 2013, 07:27:40 AM
Quote from: Gunther on October 12, 2013, 06:21:26 AM
That's the approximation error, of course.
No, I use it to bound both the error of approximating a function with a polynomial and the error of evaluating the polynomial in IEEE arithmetic (i.e. roundoff error in polynomial evaluation).
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 12, 2013, 07:34:27 AM
Quote from: qWord on October 12, 2013, 06:25:15 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker[1]?
Yes, I use Dekker's double-double format.

Quote from: qWord on October 12, 2013, 06:25:15 AM
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
I do use SSE2/AVX/FMA3/FMA4/NEON. Here are the implementations (https://bitbucket.org/MDukhan/yeppp/src/52b388a8ddfef957384b7bb6096bf4aba89d31b3/codegen/yeppp/library/math/x64.py?at=default#cl-5935).
Title: Re: Yeppp! High performance mathematical library
Post by: jj2007 on October 12, 2013, 08:04:06 AM
Quote from: jj2007 on October 12, 2013, 07:18:17 AM
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Let's test it:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc


(not yet a serious exercise, just a demo that C compilers can be beaten...)
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 12, 2013, 08:06:22 AM
Jochen,

YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: qWord on October 12, 2013, 08:08:06 AM
Quote from: Marat Dukhan on October 12, 2013, 07:34:27 AMI do use SSE2/AVX/FMA3/FMA4/NEON.
yes, but there is code that evaluates polynomials using scalar SSE2 instructions:
Quote from: log.x64-ms.asm   MOVSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c8]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c9]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c10]
this code can be parallelized using the packed SSE2 instructions mulPd/ addPd:
P(x) = c0 + c1x + c2x2 + c3x3 + c4x4 + ...
     = c0 +  x(c1 + c3x2 + c5x4 + ...)
          + x2(c2 + c4x2 + c6x4 + ...)

the two bracket expressions can be evaluated parallel (Horner scheme).
Title: Re: Yeppp! High performance mathematical library
Post by: jj2007 on October 12, 2013, 08:10:23 AM
Quote from: Gunther on October 12, 2013, 08:06:22 AM
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther,
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,

  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

The ExpandEnv$() serves to make it language neutral; for example, my "Program Files" are "Programmi" because the OS is Italian.
Title: Re: Yeppp! High performance mathematical library
Post by: Gunther on October 12, 2013, 08:15:43 AM
Jochen,

Quote from: jj2007 on October 12, 2013, 08:10:23 AM
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

Ah that's it. I'll do the installation forthwith. Anyway, here is your test result:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++7 of 20 tests valid, loop overhead is approx. 179/100 cycles

226     cycles for 100 * Yeppp! original
198     cycles for 100 * Yeppp! movzx inc

228     cycles for 100 * Yeppp! original
199     cycles for 100 * Yeppp! movzx inc

226     cycles for 100 * Yeppp! original
200     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc

--- ok ---


Gunther
Title: Re: Yeppp! High performance mathematical library
Post by: Marat Dukhan on October 12, 2013, 11:17:51 AM
Quote from: jj2007 on October 12, 2013, 07:18:17 AM
Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.
Title: Re: Yeppp! High performance mathematical library
Post by: jj2007 on October 12, 2013, 04:37:01 PM
Quote from: Marat Dukhan on October 12, 2013, 11:17:51 AM
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

Marat,
This was just an example (and the array, btw, has exactly 100 bytes ;)). Compilers have their strengths and their limitations...
Still, if you could identify a routine that...
- is frequently needed in the scientific & mathematical community
- maybe is frequently used in benchmarking mathematical libraries ;)
- is so slow that it forces people to have a coffee break
... then we could create a testbed and call it a challenge.
:icon14:?