Yeppp! High performance mathematical library

jj2007 · October 12, 2013, 07:18:17 AM

This simple demo sums up two byte size arrays:

include \masm32\MasmBasic\MasmBasic.inc ; download

.data ; define two source arrays:
arrs1 db 10, 20, 30
arrs2 db 100, 100, 100

.data? ; destination array
arrdest db 3 dup(?)

Enum 0:YepStatusOk, YepStatusNullPointer, YepStatusMisalignedPointer, YepStatusInvalidArgument
Enum YepStatusInvalidData, YepStatusInvalidState, YepStatusUnsupportedHardware, YepStatusUnsupportedSoftware
Enum YepStatusInsufficientBuffer, YepStatusOutOfMemory, YepStatusSystemError, YepStatusAccessDenied

Init
Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")
Declare yepCore_Add_V8sV8s_V8s, C:4 ; C calling convention, 4 arguments
Declare yepLibrary_Init, 0

void yepLibrary_Init()
.if yepCore_Add_V8sV8s_V8s(offset arrs1, offset arrs2, offset arrdest, lengthof arrdest)==YepStatusOk
movzx eax, byte ptr arrdest[0]
Print Str$("Element 0=\t%i\n", eax)
movzx eax, byte ptr arrdest[1]
Print Str$("Element 1=\t%i\n", eax)
movzx eax, byte ptr arrdest[2]
Inkey Str$("Element 2=\t%i\n", eax)
.else
Inkey Str$("Yeppp!!!! error %i", eax)
.endif
Exit
end start

Output:
Element 0= 110
Element 1= 120
Element 2= 130

Here is the innermost loop:
10001E02 ³> Ú8A06 Úmov al, [esi]
10001E04 ³. ³8A0C13 ³mov cl, [edx+ebx]
10001E07 ³. ³02C8 ³add cl, al
10001E09 ³. ³880A ³mov [edx], cl
10001E0B ³. ³8D76 01 ³lea esi, [esi+1]
10001E0E ³. ³8D52 01 ³lea edx, [edx+1]
10001E11 ³. ³4F ³dec edi
10001E12 ³.À75 EE Àjnz short 10001E02

- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

Marat Dukhan · October 12, 2013, 07:27:40 AM

Quote from: Gunther on October 12, 2013, 06:21:26 AM
That's the approximation error, of course.

No, I use it to bound both the error of approximating a function with a polynomial and the error of evaluating the polynomial in IEEE arithmetic (i.e. roundoff error in polynomial evaluation).

Marat Dukhan · October 12, 2013, 07:34:27 AM

Quote from: qWord on October 12, 2013, 06:25:15 AM
Quote from: Marat Dukhan on October 12, 2013, 05:41:21 AMYes, double-double is an unevaluated sum of two doubles. From high-level perspective you can imagine it as double with 106-bit mantissa.
I'm right in when assuming the method described by T.J. Dekker^[1]?
Yes, I use Dekker's double-double format.

Quote from: qWord on October 12, 2013, 06:25:15 AM
For the polynomials, it might be possible to reformulate them in such a way that the packed instructions (SSE2, add/mulpd) can be used.
I do use SSE2/AVX/FMA3/FMA4/NEON. Here are the implementations.

jj2007 · October 12, 2013, 08:04:06 AM

Quote from: jj2007 on October 12, 2013, 07:18:17 AM
- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

Let's test it:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc

417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc

417 cycles for 100 * Yeppp! original
318 cycles for 100 * Yeppp! movzx inc

37 bytes for Yeppp! original
35 bytes for Yeppp! movzx inc

(not yet a serious exercise, just a demo that C compilers can be beaten...)

Gunther · October 12, 2013, 08:06:22 AM

Jochen,

YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther

qWord · October 12, 2013, 08:08:06 AM

Quote from: Marat Dukhan on October 12, 2013, 07:34:27 AMI do use SSE2/AVX/FMA3/FMA4/NEON.

yes, but there is code that evaluates polynomials using scalar SSE2 instructions:

Quote from: log.x64-ms.asm   MOVSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c8]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c9]
   MULSD xmm5, xmm4
   ADDSD xmm5, [rel _yepMath_Log_V64f_V64f_Nehalem_constants.c10]

this code can be parallelized using the packed SSE2 instructions mulPd/ addPd:
P(x) = c₀ + c₁x + c₂x² + c₃x³ + c₄x⁴ + ...
= c₀ + x(c₁ + c₃x² + c₅x⁴ + ...)
+ x²(c₂ + c₄x² + c₆x⁴ + ...)
the two bracket expressions can be evaluated parallel (Horner scheme).

jj2007 · October 12, 2013, 08:10:23 AM

Quote from: Gunther on October 12, 2013, 08:06:22 AM
YepppTest.exe doesn't work with my machine. It occurs a fatal error: GetLastError (line ??). Modul not found.

Gunther,
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,

Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

The ExpandEnv$() serves to make it language neutral; for example, my "Program Files" are "Programmi" because the OS is Italian.

Gunther · October 12, 2013, 08:15:43 AM

Jochen,

Quote from: jj2007 on October 12, 2013, 08:10:23 AM
Sorry, I forgot to specify that you need a Yeppp! installation. More precisely: C:\Program Files\Yeppp! SDK\binaries\windows\x86\yeppp.dll - see above,
Dll ExpandEnv$("%ProgramFiles%\Yeppp!!!! SDK\binaries\windows\x86\yeppp.dll")

Ah that's it. I'll do the installation forthwith. Anyway, here is your test result:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++7 of 20 tests valid, loop overhead is approx. 179/100 cycles

226     cycles for 100 * Yeppp! original
198     cycles for 100 * Yeppp! movzx inc

228     cycles for 100 * Yeppp! original
199     cycles for 100 * Yeppp! movzx inc

226     cycles for 100 * Yeppp! original
200     cycles for 100 * Yeppp! movzx inc

37      bytes for Yeppp! original
35      bytes for Yeppp! movzx inc

--- ok ---

Gunther

Marat Dukhan · October 12, 2013, 11:17:51 AM

Quote from: jj2007 on October 12, 2013, 07:18:17 AM
Here is the innermost loop:
10001E02 ³> Ú8A06 Úmov al, [esi]
10001E04 ³. ³8A0C13 ³mov cl, [edx+ebx]
10001E07 ³. ³02C8 ³add cl, al
10001E09 ³. ³880A ³mov [edx], cl
10001E0B ³. ³8D76 01 ³lea esi, [esi+1]
10001E0E ³. ³8D52 01 ³lea edx, [edx+1]
10001E11 ³. ³4F ³dec edi
10001E12 ³.À75 EE Àjnz short 10001E02

- movzx eax, byte ptr [esi] is often faster than mov al, [esi]
- lea is rarely faster than inc esi

This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

jj2007 · October 12, 2013, 04:37:01 PM

Quote from: Marat Dukhan on October 12, 2013, 11:17:51 AM
This is compiler-generated code. Yeppp 1.0.0 does not have optimized kernels for 32-bit x86. Try 64-bit Yeppp! for much better performance. BTW, it is optimized for array sizes of hundred elements or more.

Marat,
This was just an example (and the array, btw, has exactly 100 bytes ;)). Compilers have their strengths and their limitations...
Still, if you could identify a routine that...
- is frequently needed in the scientific & mathematical community
- maybe is frequently used in benchmarking mathematical libraries ;)
- is so slow that it forces people to have a coffee break
... then we could create a testbed and call it a challenge.
:icon14:?

The MASM Forum

News: