News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Yeppp! High performance mathematical library

Started by Gunther, October 08, 2013, 12:28:08 PM

Previous topic - Next topic

jj2007

This simple demo sums up two byte size arrays:

include \masm32\MasmBasic\MasmBasic.inc        ; download

.data        ; define two source arrays:
arrs1        db 10, 20, 30
arrs2        db 100, 100, 100

.data?        ; destination array
arrdest        db 3 dup(?)

Enum 0:YepStatusOk, YepStatusNullPointer, YepStatusMisalignedPointer, YepStatusInvalidArgument
Enum YepStatusInvalidData, YepStatusInvalidState, YepStatusUnsupportedHardware, YepStatusUnsupportedSoftware
Enum YepStatusInsufficientBuffer, YepStatusOutOfMemory, YepStatusSystemError, YepStatusAccessDenied

  Init
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
  Declare yepCore_Add_V8sV8s_V8s, C:4        ; C calling convention, 4 arguments
  Declare yepLibrary_Init, 0

  void yepLibrary_Init()
  .if yepCore_Add_V8sV8s_V8s(offset arrs1, offset arrs2, offset arrdest, lengthof arrdest)==YepStatusOk
        movzx eax, byte ptr arrdest[0]
        Print Str$("Element 0=\t%i\n", eax)
        movzx eax, byte ptr arrdest[1]
        Print Str$("Element 1=\t%i\n", eax)
        movzx eax, byte ptr arrdest[2]
        Inkey Str$("Element 2=\t%i\n", eax)
  .else
        Inkey Str$("Yeppp!!!! error %i", eax)
  .endif
  Exit
end start


Output:
Element 0=      110
Element 1=      120
Element 2=      130


Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

Marat Dukhan

Quote from: Gunther on October 12, 2013, 06:21:26 AM
That's the approximation error, of course.
No, I use it to bound both the error of approximating a function with a polynomial and the error of evaluating the polynomial in IEEE arithmetic (i.e. roundoff error in polynomial evaluation).

Marat Dukhan

Quote from: qWord on October 12, 2013, 06:25:15 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker[1]?
Yes, I use Dekker's double-double format.

Quote from: qWord on October 12, 2013, 06:25:15 AM
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
I do use SSE2/AVX/FMA3/FMA4/NEON. Here are the implementations.

jj2007

Quote from: jj2007 on October 12, 2013, 07:18:17 AM
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi
Let's test it:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

417     cycles for 100 * Yeppp! original
318     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc


(not yet a serious exercise, just a demo that C compilers can be beaten...)

Gunther

Jochen,

YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther
You have to know the facts before you can distort them.

qWord

Quote from: Marat Dukhan on October 12, 2013, 07:34:27 AMI do use SSE2/AVX/FMA3/FMA4/NEON.
yes, but there is code that evaluates polynomials using scalar SSE2 instructions:
Quote from: log.x64-ms.asm   MOVSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c8]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c9]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c10]
this code can be parallelized using the packed SSE2 instructions mulPd/ addPd:
P(x) = c0 + c1x + c2x2 + c3x3 + c4x4 + ...
     = c0 +  x(c1 + c3x2 + c5x4 + ...)
          + x2(c2 + c4x2 + c6x4 + ...)

the two bracket expressions can be evaluated parallel (Horner scheme).
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: Gunther on October 12, 2013, 08:06:22 AM
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther,
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,

  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

The ExpandEnv$() serves to make it language neutral; for example, my "Program Files" are "Programmi" because the OS is Italian.

Gunther

Jochen,

Quote from: jj2007 on October 12, 2013, 08:10:23 AM
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
  Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

Ah that's it. I'll do the installation forthwith. Anyway, here is your test result:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++7 of 20 tests valid, loop overhead is approx. 179/100 cycles

226     cycles for 100 * Yeppp! original
198     cycles for 100 * Yeppp! movzx inc

228     cycles for 100 * Yeppp! original
199     cycles for 100 * Yeppp! movzx inc

226     cycles for 100 * Yeppp! original
200     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc

--- ok ---


Gunther
You have to know the facts before you can distort them.

Marat Dukhan

Quote from: jj2007 on October 12, 2013, 07:18:17 AM
Here is the innermost loop:
10001E02                ³> Ú8A06                Úmov al, [esi]
10001E04                ³. ³8A0C13              ³mov cl, [edx+ebx]
10001E07                ³. ³02C8                ³add cl, al
10001E09                ³. ³880A                ³mov [edx], cl
10001E0B                ³. ³8D76 01             ³lea esi, [esi+1]
10001E0E                ³. ³8D52 01             ³lea edx, [edx+1]
10001E11                ³. ³4F                  ³dec edi
10001E12                ³.À75 EE               Àjnz short 10001E02


- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

jj2007

Quote from: Marat Dukhan on October 12, 2013, 11:17:51 AM
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

Marat,
This was just an example (and the array, btw, has exactly 100 bytes ;)). Compilers have their strengths and their limitations...
Still, if you could identify a routine that...
- is frequently needed in the scientific & mathematical community
- maybe is frequently used in benchmarking mathematical libraries ;)
- is so slow that it forces people to have a coffee break
... then we could create a testbed and call it a challenge.
:icon14:?