The Intel Math Kernel Library - or short - MKL (http://software.intel.com/en-us/intel-math-kernel-library-intel-mkl-purchase) is well known, but not so cheap. Of course, there are alternatives. For example, AMD has a similar solution. (http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/) But both libraries work perfectly with the corresponding CPU. This is a clear disadvantage.
There is a relative new, cross-platform alternative: Yeppp! (http://www.yeppp.info/index.html) It's used by the CERN data department and seems to be not so bad. Interesting enough: The Yeppp! kernel is written in assembly language (with a bit Python help). They have 3 simple design principles:
- Avoid Table Lookups, because these are hard to use with SIMD instructions.
- Avoid division and square roots. Use polynomial approximations and Newton-Raphson approximation instead.
- Avoid piecewise approximations and branches in the primary code paths.
Very interesting and the results speak for themselves.
Gunther
this is an ASM forum, Gunther
we try to make our own stuff :biggrin:
Hi Dave,
Quote from: dedndave on October 09, 2013, 04:45:21 AM
this is an ASM forum, Gunther
we try to make our own stuff :biggrin:
Good idea. We can try to make our own HPC library. Why not? Any ideas?
Gunther
not at the moment - lol
(brain fart)
over time, forum members have introduced a number of routines, though
the old forum archive is full of great code and information
Quote from: dedndave on October 09, 2013, 06:30:11 AM
over time, forum members have introduced a number of routines, though
the old forum archive is full of great code and information
Right, let's think about that.
Gunther
A while back I've played a lot with table based function approximations, especially of exp(x) and log2(x). What I've found is that there is (IMO) no way around table based methods to get acceptable precision and speed. After taking a quick look to Yeppp's source, I currently have doubts on the efficient of used algorithms, because they use polynomials of 1xth degree.
BTW, a good book on this subject is "Elementary Functions Algorithms and Implementation" by Jean-Michel Muller.
An also interesting paper is "An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard" by S. Gal and B. Bachells. I've used the accurate tables method they descripted for exp(x). After implementing it, I did a compare to the FPU and HPALIB (112 bit precision) with the disillusioning result that the relative error is less than 4E-14 over the whole range (REAL8). However, I've noticed that the MSVCRT use exact the same method (=same error) thus I stay with it :biggrin:.
Hi qWord,
Quote from: qWord on October 09, 2013, 09:58:09 AM
BTW, a good book on this subject is "Elementary Functions Algorithms and Implementation" by Jean-Michel Muller.
An also interesting paper is "An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard" by S. Gal and B. Bachells. I've used the accurate tables method they descripted for exp(x). After implementing it, I did a compare to the FPU and HPALIB (112 bit precision) with the disillusioning result that the relative error is less than 4E-14 over the whole range (REAL8). However, I've noticed that the MSVCRT use exact the same method (=same error) thus I stay with it :biggrin:.
An interesting point of view. Do you have a link to the paper by Gal and Bachells?
Gunther
Quote from: Gunther on October 09, 2013, 10:21:37 AMDo you have a link to the paper by Gal and Bachells?
I'm afraid is not free downloadable. IIRC I've got from the ACM Digital Library database (or maybe IEEE Xplore).
Quote from: qWord on October 09, 2013, 10:30:10 AM
I'm afraid is not free downloadable. IIRC I've got from the ACM Digital Library database (or maybe IEEE Xplore).
That's bad luck. I would be interested to read the paper.
Gunther
http://en.wikipedia.org/wiki/Gal%27s_accurate_tables (http://en.wikipedia.org/wiki/Gal%27s_accurate_tables)
at the bottom of the page is a link to a PDF: "Gal's Accurate Tables Method Revisited"
Quote
Yeppp! binaries for Windows, Android, Mac OS X, and Linux.
Bindings for Java, .Net/Mono, and FORTRAN.
Whoa. This is awesome.
But wait... my numerical analysis skill is already rusty :redface:
Need some refreshing, I guess :biggrin:
Dave,
Quote from: dedndave on October 09, 2013, 01:07:33 PM
http://en.wikipedia.org/wiki/Gal%27s_accurate_tables (http://en.wikipedia.org/wiki/Gal%27s_accurate_tables)
at the bottom of the page is a link to a PDF: "Gal's Accurate Tables Method Revisited"
Thank you for the link. It's an interesting article.
Gunther
Dave,
What I saw was a (I believe) a French paper that was being automatically being translated into English (if you jump from one page past the next page, the displayed page is blank and is filled in line by line).
I wonder how accurate the translation is?
Dave.
Hi Dave,
Quote from: KeepingRealBusy on October 10, 2013, 05:45:49 AM
What I saw was a (I believe) a French paper that was being automatically being translated into English (if you jump from one page past the next page, the displayed page is blank and is filled in line by line).
I'm not sure. What they have done is the Resume (abstract) in French. The text is probably written in English.
Gunther
Gunther,
I guess with a math paper, the translation would probably be accurate in any case.
Dave.
Dave,
Quote from: KeepingRealBusy on October 10, 2013, 05:53:48 AM
I guess with a math paper, the translation would probably be accurate in any case.
Dave.
I think no one would read it if it were not written in English. It is the lingua franca in science.
Gunther
Quote from: Gunther on May 19, 1974, 02:12:08 AM
Good idea. We can try to make our own HPC library. Why not? Any ideas?
Gunther
That's possible to join entusiasts on Math DLL Project (http://www.harrix.org), on old site version there is it.
Quote from: Adamanteus on October 11, 2013, 08:33:14 AM
That's possible to join entusiasts on Math DLL Project (http://www.harrix.org), on old site version there is it.
Couldn't find a HPC approach.
Gunther
Yeppp! author here and ready to answer your questions!
Quote from: Marat Dukhan on October 11, 2013, 06:06:32 PM
Yeppp! author here and ready to answer your questions!
Yeppp, that's an excellent service!
Welcome to the Forum. We have some looooong threads aiming at beating the best C compilers with even better hand-crafted assembler. If you have a frequently-used Yeppp! algo that needs a speedup, this is your chance :bgrin:
But of course, some of us will also be interested to produce an interface Yeppp! - Masm. GSL has already one. (http://www.masmforum.com/board/index.php?topic=16725.0)
Quote from: jj2007 on October 11, 2013, 07:05:51 PM
If you have a frequently-used Yeppp! algo that needs a speedup, this is your chance :bgrin:
All of Yeppp! is a collection of algos which need speedups. Contributors are welcome! :icon14:
Quote from: jj2007 on October 11, 2013, 07:05:51 PM
But of course, some of us will also be interested to produce an interface Yeppp! - Masm.
This is easy. Most of Yeppp! is auto-generated, so one would only need to patch the code-generator (https://bitbucket.org/MDukhan/yeppp/src/52b388a8ddfef957384b7bb6096bf4aba89d31b3/codegen/yeppp/codegen.py?at=default#cl-317).
welcome to the forum :t
i haven't used the library yet, but i have looked it over
looks very interesting :biggrin:
Marat,
first things first: Welcome to the forum.
It's good to know that another experienced programmer is floating around here. I hope you'll have fun.
Gunther
Quote from: Gunther on October 11, 2013, 09:02:07 AM
Couldn't find a HPC approach.
Gunther
YES - there only algos, but using threads or standalone MPI-like libraries is deal for it - or, "two in one" only modern :eusa_boohoo:
Marat Dukhan,
is there any documentation (or literature) on the used methods for approximation? What precision-reference did you use to test the math functions (the folder unit-test/source/math is empty).
regards, qWord
Quote from: qWord on October 12, 2013, 04:29:00 AM
is there any documentation (or literature) on the used methods for approximation? What precision-reference did you use to test the math functions (the folder unit-test/source/math is empty).
The framework I developed for testing performance & accuracy is open sourced at bitbucket.org/MDukhan/hysteria (http://bitbucket.org/MDukhan/hysteria). For accuracy testing it computes functions to 160-bit precision with GMP, and then converts the result to double-double. I also work on mathematical proofs of error bounds, and plan to release a tech report on that.
Marat,
Quote from: Marat Dukhan on October 12, 2013, 04:49:15 AM
The framework I developed for testing performance & accuracy is open sourced at bitbucket.org/MDukhan/hysteria (http://bitbucket.org/MDukhan/hysteria). For accuracy testing it computes functions to 160-bit precision with GMP, and then converts the result to double-double. I also work on mathematical proofs of error bounds, and plan to release a tech report on that.
Thank you for your fast reply. Your test framework looks impressive. Is double-double 128 bit? Which kind of error analysis do you use?
Gunther
Quote from: Gunther on October 12, 2013, 05:11:36 AM
Is double-double 128 bit?
Yes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
Quote from: Gunther on October 12, 2013, 05:11:36 AM
Which kind of error analysis do you use?
To analyse error of interpolating polynomial, I split function domain into intervals of constant ULP, bound the absolute error of approximation (with Sollya (http://sollya.gforge.inria.fr/)) and evaluation (with Gappa (http://gappa.gforge.inria.fr/)), compute the total ULP error on each interval, and choose the maximum of all intervals. Other kinds of errors (range reduction and reconstruction) are analysed on case-by-case basis.
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AM
To analyse error of interpolating polynomial, I split function domain into intervals of constant ULP, bound the absolute error of approximation (with Sollya (http://sollya.gforge.inria.fr/)) and evaluation (with Gappa (http://gappa.gforge.inria.fr/)), compute the total ULP error on each interval, and choose the maximum of all intervals. Other kinds of errors (range reduction and reconstruction) are analysed on case-by-case basis.
That's the approximation error, of course. My question has two directions. The other is: What's with the numerical stability? Do you use some backward error analysis (introduced by Wilkinson (http://computer.org/computer-pioneers/wilkinson.html)) or an error interval (see my PM)?
Gunther
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker
[1]?
Quote from: Marat Dukhan on October 11, 2013, 07:42:32 PM
All of Yeppp! is a collection of algos which need speedups. Contributors are welcome! :icon14:
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
[1]T.J. Dekker, "A Floating-Point Technique for Extending the Available Precision", Springer
This simple demo sums up two byte size arrays:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
.data ; define two source arrays:
arrs1 db 10, 20, 30
arrs2 db 100, 100, 100
.data? ; destination array
arrdest db 3 dup(?)
Enum 0:YepStatusOk, YepStatusNullPointer, YepStatusMisalignedPointer, YepStatusInvalidArgument
Enum YepStatusInvalidData, YepStatusInvalidState, YepStatusUnsupportedHardware, YepStatusUnsupportedSoftware
Enum YepStatusInsufficientBuffer, YepStatusOutOfMemory, YepStatusSystemError, YepStatusAccessDenied
Init
Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
Declare yepCore_Add_V8sV8s_V8s, C:4 ; C calling convention, 4 arguments
Declare yepLibrary_Init, 0
void yepLibrary_Init()
.if yepCore_Add_V8sV8s_V8s(offset arrs1, offset arrs2, offset arrdest, lengthof arrdest)==YepStatusOk
movzx eax, byte ptr arrdest[0]
Print Str$("Element 0=\t%i\n", eax)
movzx eax, byte ptr arrdest[1]
Print Str$("Element 1=\t%i\n", eax)
movzx eax, byte ptr arrdest[2]
Inkey Str$("Element 2=\t%i\n", eax)
.else
Inkey Str$("Yeppp!!!! error %i", eax)
.endif
Exit
end start
Output:
Element 0= 110
Element 1= 120
Element 2= 130
Here is the innermost loop:
10001E02 ³> Ú8A06 Úmov al, [esi]
10001E04 ³. ³8A0C13 ³mov cl, [edx+ebx]
10001E07 ³. ³02C8 ³add cl, al
10001E09 ³. ³880A ³mov [edx], cl
10001E0B ³. ³8D76 01 ³lea esi, [esi+1]
10001E0E ³. ³8D52 01 ³lea edx, [edx+1]
10001E11 ³. ³4F ³dec edi
10001E12 ³.À75 EE Àjnz short 10001E02
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Quote from: Gunther on October 12, 2013, 06:21:26 AM
That's the approximation error, of course.
No, I use it to bound both the error of approximating a function with a polynomial and the error of evaluating the polynomial in IEEE arithmetic (i.e. roundoff error in polynomial evaluation).
Quote from: qWord on October 12, 2013, 06:25:15 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker[1]?
Yes, I use Dekker's double-double format.
Quote from: qWord on October 12, 2013, 06:25:15 AM
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
I do use SSE2/AVX/FMA3/FMA4/NEON. Here are the implementations (https://bitbucket.org/MDukhan/yeppp/src/52b388a8ddfef957384b7bb6096bf4aba89d31b3/codegen/yeppp/library/math/x64.py?at=default#cl-5935).
Quote from: jj2007 on October 12, 2013, 07:18:17 AM
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Let's test it:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc
417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc
417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc
37 bytes for Yeppp! original
35 bytes for Yeppp! movzx inc(not yet a serious exercise, just a demo that C compilers can be beaten...)
Jochen,
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.
Gunther
Quote from: Marat Dukhan on October 12, 2013, 07:34:27 AMI do use SSE2/AVX/FMA3/FMA4/NEON.
yes, but there is code that evaluates polynomials using scalar SSE2 instructions:
Quote from: log.x64-ms.asm MOVSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c8]
MULSD xmm5, xmm4
ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c9]
MULSD xmm5, xmm4
ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c10]
this code can be parallelized using the packed SSE2 instructions mulPd/ addPd:
P(x) = c0 + c1x + c2x2 + c3x3 + c4x4 + ...
= c0 + x(c1 + c3x2 + c5x4 + ...)
+ x2(c2 + c4x2 + c6x4 + ...)the two bracket expressions can be evaluated parallel (Horner scheme).
Quote from: Gunther on October 12, 2013, 08:06:22 AM
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.
Gunther,
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
The ExpandEnv$() serves to make it language neutral; for example, my "Program Files" are "Programmi" because the OS is Italian.
Jochen,
Quote from: jj2007 on October 12, 2013, 08:10:23 AM
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
Ah that's it. I'll do the installation forthwith. Anyway, here is your test result:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++7 of 20 tests valid, loop overhead is approx. 179/100 cycles
226 cycles for 100 * Yeppp! original
198 cycles for 100 * Yeppp! movzx inc
228 cycles for 100 * Yeppp! original
199 cycles for 100 * Yeppp! movzx inc
226 cycles for 100 * Yeppp! original
200 cycles for 100 * Yeppp! movzx inc
37 bytes for Yeppp! original
35 bytes for Yeppp! movzx inc
--- ok ---
Gunther
Quote from: jj2007 on October 12, 2013, 07:18:17 AM
Here is the innermost loop:
10001E02 ³> Ú8A06 Úmov al, [esi]
10001E04 ³. ³8A0C13 ³mov cl, [edx+ebx]
10001E07 ³. ³02C8 ³add cl, al
10001E09 ³. ³880A ³mov [edx], cl
10001E0B ³. ³8D76 01 ³lea esi, [esi+1]
10001E0E ³. ³8D52 01 ³lea edx, [edx+1]
10001E11 ³. ³4F ³dec edi
10001E12 ³.À75 EE Àjnz short 10001E02
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.
Quote from: Marat Dukhan on October 12, 2013, 11:17:51 AM
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.
Marat,
This was just an example (and the array, btw, has exactly 100 bytes ;)). Compilers have their strengths and their limitations...
Still, if you could identify a routine that...
- is frequently needed in the scientific & mathematical community
- maybe is frequently used in benchmarking mathematical libraries ;)
- is so slow that it forces people to have a coffee break
... then we could create a testbed and call it a challenge.
:icon14:?