Author Topic: Creativity question.  (Read 293 times)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Creativity question.
« on: November 19, 2017, 05:04:52 PM »
I wonder how hard it would be to make a 64 bit SSE or later library to perform a range of common maths tasks ? I know you can do it up to 80 bit FP using the FP registers and mnemonics but I have done little work with maths in SSE or later and the aim with such an idea is to take advantage of the extra speed of SSE and later as against the older FP registers and mnemonics.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Creativity question.
« Reply #1 on: November 19, 2017, 06:28:22 PM »
Quick test with the GSL, picking arbitrarily this line...
Code: [Select]
int 3
Print Str$("\ns_variance= \t %5f", gsl_stats_variance(ecx, 1, ebx))

... and hitting F7 a few times:
Code: [Select]
Address         Hex dump              Command                     Comments
630E7AB0        Ú> À55                push ebp
630E7AB1        ³.  8BEC              mov ebp, esp
...
630E7ACC        ³.  F20F1045 F8       movsd xmm0, [ebp-8]          ; <<<< SIMD #######
630E7AD1        ³.  F20F110424        movsd [esp], xmm0           ; ÚArg4_5
630E7AD6        ³.  53                push ebx                    ; ³Arg3 => [ARG.3]
630E7AD7        ³.  FF75 0C           push dword ptr [ebp+0C]     ; ³Arg2 => [ARG.2]
630E7ADA        ³.  FF75 08           push dword ptr [ebp+8]      ; ³Arg1 => [ARG.1]
630E7ADD        ³.  E8 AECAFFFF       call 630E4590               ; Àgsl.630E4590
630E7AE2        ³.  660F6ECB          movd xmm1, ebx          ; <<<< SIMD #######
630E7AE6        ³.  8BC3              mov eax, ebx
630E7AE8        ³.  F30FE6C9          cvtdq2pd xmm1, xmm1          ; <<<< SIMD #######
630E7AEC        ³.  C1E8 1F           shr eax, 1F
630E7AEF        ³.  83C4 14           add esp, 14
630E7AF2        ³.  F20F580CC5 00B912 addsd xmm1, [eax*8+6312B900
630E7AFB        ³.  8D43 FF           lea eax, [ebx-1]
630E7AFE        ³.  660F6EC0          movd xmm0, eax
630E7B02        ³.  F30FE6C0          cvtdq2pd xmm0, xmm0          ; <<<< SIMD #######
630E7B06        ³.  C1E8 1F           shr eax, 1F
630E7B09        ³.  5B                pop ebx
630E7B0A        ³.  F20F5804C5 00B912 addsd xmm0, [eax*8+6312B900
630E7B13        ³.  F20F5EC8          divsd xmm1, xmm0          ; <<<< SIMD #######
630E7B17        ³.  F20F114D F8       movsd [ebp-8], xmm1          ; <<<< SIMD #######
630E7B1C        ³.  DC4D F8           fmul qword ptr [ebp-8]
630E7B1F        ³.  8BE5              mov esp, ebp
630E7B21        ³.  5D                pop ebp
630E7B22        À.  C3                retn

That's the 32-bit version, of course. But it seems obvious that the guys have had the same idea before. I know that you have a problem with the GSL, but this stuff is so complicated, why reinvent the wheel if a bunch of real mathematicians have done the job already, and give you over 1,000 functions ready to be linked in?

There is a discussion comparing Eigen, GSL and others at ResearchGate:
Quote
But if you want speed, GSL is the best choice: http://www.gnu.org/software/gsl/

A comparison between Boost and GSL says the latter is much faster (and Boost is a behemoth, too).

There is a good overview at University of Utah, CHPC - Research Computing Support for the University: Math Libraries

We had a long thread about Yeppp!, and while I just confirmed that this snippet still builds and runs fine (provided you use ML 6.15 ... 10.0 - one of the rare cases where UAsm is not compatible...), it also turns out that the latest DLL doesn't use SIMD instructions. Yeppp! looks pretty dead :(

And then there is the Intel MKL, and guess what? We tested it already :biggrin:

LiaoMi

  • Member
  • **
  • Posts: 162
Re: Creativity question.
« Reply #2 on: November 19, 2017, 08:10:44 PM »
Hi everybody,

here an interesting library, you can estimate the amount of work https://github.com/VcDevel/Vc

Vc-master.zip\Vc-master\attic\sse - ZIP archive, unpacked size 4 280 387 bytes

Vc-master\attic\sse\casts.h
Vc-master\attic\sse\const.h
Vc-master\attic\sse\const_data.h
Vc-master\attic\sse\debug.h
Vc-master\attic\sse\deinterleave.tcc
Vc-master\attic\sse\detail.h
Vc-master\attic\sse\helperimpl.h
Vc-master\attic\sse\intrinsics.h
Vc-master\attic\sse\limits.h
Vc-master\attic\sse\macros.h
Vc-master\attic\sse\mask.h
Vc-master\attic\sse\mask.tcc
Vc-master\attic\sse\math.h
Vc-master\attic\sse\prefetches.tcc
Vc-master\attic\sse\shuffle.h
Vc-master\attic\sse\simd_cast.h
Vc-master\attic\sse\simd_cast_caller.tcc
Vc-master\attic\sse\type_traits.h
Vc-master\attic\sse\types.h
Vc-master\attic\sse\vector.h
Vc-master\attic\sse\vector.tcc
Vc-master\attic\sse\vectorhelper.h
Vc-master\attic\sse\vectorhelper.tcc

Vc is a free software library to ease explicit vectorization of C++ code. It has an intuitive API and provides portability between different compilers and compiler versions as well as portability between different vector instruction sets. Thus an application written with Vc can be compiled for:

AVX and AVX2
SSE2 up to SSE4.2 or SSE4a
Scalar
MIC (only before Vc 2.0)
AVX-512
NEON (in development)
NVIDIA GPUs / CUDA (research)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Creativity question.
« Reply #3 on: November 19, 2017, 08:13:29 PM »
I think you already know my response, I ask from GPL what I give to them, nothing as I don't need their licence or the strings that come with it. We used to have programmers once that wrote useful things, sad to see it disappear.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

aw27

  • Member
  • ****
  • Posts: 851
  • Let's Make ASM Great Again!
Re: Creativity question.
« Reply #4 on: November 19, 2017, 08:25:09 PM »
I wonder how hard it would be to make a 64 bit SSE or later library to perform a range of common maths tasks ? I know you can do it up to 80 bit FP using the FP registers and mnemonics but I have done little work with maths in SSE or later and the aim with such an idea is to take advantage of the extra speed of SSE and later as against the older FP registers and mnemonics.

SSE (MMX and AVX as well) really shine for vectors but FP has lots of internal primitives (sin, cos, ln) that have no counterpart in SSE and have to be done in software. This can be quite involving.

As an example, the MASM equivalent of the DirectX XMScalarCos (64-bit).

Code: [Select]

.code

XMScalarCos proc public
movaps xmm1, xmm0
movaps xmm2, xmm0
mulss xmm2, _XM_1DIV2PI ; xmm2=quocient
comiss xmm1, _XM_REAL4ZERO
.if ABOVEEQUAL?
addss xmm2, _XM_REAL4HALF
cvttss2si eax, xmm2
cvtsi2ss xmm2, eax
.else
subss xmm2, _XM_REAL4HALF
cvttss2si eax, xmm2
cvtsi2ss xmm2, eax
.endif
mulss xmm2, _XM_2PI
movss xmm1, xmm0
subss xmm1, xmm2 ; xmm1=y
movss xmm0, xmm1
comiss xmm0, _XM_PIDIV2
.if ABOVE?
movss xmm2, _XM_PI
subss xmm2, xmm1
movss xmm1, xmm2
movss xmm2, _XM_MINUSONE ; xmm2=sign
.else
movss xmm0, xmm1
comiss xmm0, _XM_MINUSPIDIV2
.if BELOW?
movss xmm2, _XM_MINUSPI
subss xmm2, xmm1
movss xmm1, xmm2
movss xmm2, _XM_MINUSONE ; xmm2=sign
.else
movss xmm2, _XM_PLUSONE
.endif
.endif
mulss xmm1, xmm1 ; xmm1 now =y^2

movss xmm3, _Constant1_XMScalarCos
mulss xmm3, xmm1
addss xmm3, _Constant2_XMScalarCos
mulss xmm3, xmm1
subss xmm3, _Constant3_XMScalarCos
mulss xmm3, xmm1
addss xmm3, _Constant4_XMScalarCos
mulss xmm3, xmm1
subss xmm3, _XM_REAL4HALF
mulss xmm3, xmm1
addss xmm3, _XM_PLUSONE

mulss xmm2, xmm3
movss xmm0, xmm2
ret
XMScalarCos endp

This is just for illustration, if interested I can produce complete data to make the function(s) work. Actually it was done for Jwasm/Uasm.



Siekmanski

  • Member
  • *****
  • Posts: 1139
Re: Creativity question.
« Reply #5 on: November 20, 2017, 01:05:48 AM »
Hi aw27,
I'm interested.  :t

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Creativity question.
« Reply #6 on: November 20, 2017, 01:50:18 AM »
As an example, the MASM equivalent of the DirectX XMScalarCos (64-bit).

When I post examples or code it is always complete and ready to be built
::)

aw27

  • Member
  • ****
  • Posts: 851
  • Let's Make ASM Great Again!
Re: Creativity question.
« Reply #7 on: November 20, 2017, 02:11:28 AM »
Welcome JJ!  :t
This is part of my DMath library which has more than 600 functions and is mostly a ASM translation of the DirectXMath. I can make available the source of a few functions that have some correspondence in the FP unit. But you can always see how it works looking at the Microsoft C++ open source code.  :icon_rolleyes:

@Siekmanski
All right, I will PM you tomorrow.

HSE

  • Member
  • ****
  • Posts: 550
  • <AMD>< 7-32>
Re: Creativity question.
« Reply #8 on: November 20, 2017, 02:42:01 AM »
Hi Hutch!
You can try the SSE backend of SmplMath. In some test with 32bit the calculations take only 75% of FPU backend time, because just a portion of the program is using SmplMath.

HSE

  • Member
  • ****
  • Posts: 550
  • <AMD>< 7-32>
Re: Creativity question.
« Reply #9 on: November 27, 2017, 09:02:30 AM »
Hi Atelier!

Apparently I have forgotten how to compile C#.  It's posible that you upload testApp.exe? Just to compare some results. Thanks.

The library DMath is very interesting, only a little limited if follow DirectXMath:
Quote from: MSDN
DirectXMath supports vectors of 4 single-precision floating-point or four 32-bit (signed or unsigned) values.



aw27

  • Member
  • ****
  • Posts: 851
  • Let's Make ASM Great Again!
Re: Creativity question.
« Reply #10 on: November 27, 2017, 09:04:25 PM »
Hi HSE,

Quote
Apparently I have forgotten how to compile C#.  It's posible that you upload testApp.exe?
It is attached.

Quote
only a little limited if follow DirectXMath
I have also done all the collision functions, BoundingBox, BoundingFrustrum,BoundingOrientedBox,BoundingSphere, which are not part of the DirectxMath proper, and have tested with a modification of the  Collision sample from the DirectxSdk (You can download the attachment in the next message because it goes over 512KB here). It outperforms the original.
However, I will not publish anything about these collision functions because I can't myself produce a decent Collision sample of my own.



aw27

  • Member
  • ****
  • Posts: 851
  • Let's Make ASM Great Again!
Re: Creativity question.
« Reply #11 on: November 27, 2017, 09:06:49 PM »
Collision sample mentioned above.

HSE

  • Member
  • ****
  • Posts: 550
  • <AMD>< 7-32>
Re: Creativity question.
« Reply #12 on: November 28, 2017, 01:22:16 AM »
 :t Thanks

Advanced graphics are beyond my scope but I see the collision idea. Text (buttons and listbox) are not visible in the computer that I runned the program.
I think you can modifed the colors of both objects in every collision, and randomly (or not) change intensities of R, G or B, or to follow a sequence of colors, ...

aw27

  • Member
  • ****
  • Posts: 851
  • Let's Make ASM Great Again!
Re: Creativity question.
« Reply #13 on: November 28, 2017, 02:00:43 AM »
Advanced graphics are beyond my scope but I see the collision idea. Text (buttons and listbox) are not visible in the computer that I runned the program.
I know there is a problem in some systems but it is an issue of the Microsoft sample itself, which may have been fixed in the meantime in the github repository. I really have no clue.

Siekmanski

  • Member
  • *****
  • Posts: 1139
Re: Creativity question.
« Reply #14 on: November 28, 2017, 02:53:56 AM »
Thanks, cool stuff.  8)