Print Page - Creativity question.

Title: Creativity question.
Post by: hutch-- on November 19, 2017, 05:04:52 PM

I wonder how hard it would be to make a 64 bit SSE or later library to perform a range of common maths tasks ? I know you can do it up to 80 bit FP using the FP registers and mnemonics but I have done little work with maths in SSE or later and the aim with such an idea is to take advantage of the extra speed of SSE and later as against the older FP registers and mnemonics.

Title: Re: Creativity question.
Post by: jj2007 on November 19, 2017, 06:28:22 PM

Quick test with the GSL (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1389), picking arbitrarily this line...

Code Select

	int 3
	Print Str$("\ns_variance= \t %5f", gsl_stats_variance(ecx, 1, ebx))

... and hitting F7 a few times:

Code Select

Address         Hex dump              Command                     Comments
630E7AB0        Ú> À55                push ebp
630E7AB1        ³.  8BEC              mov ebp, esp
...
630E7ACC        ³.  F20F1045 F8       movsd xmm0, [ebp-8]          ; <<<< SIMD #######
630E7AD1        ³.  F20F110424        movsd [esp], xmm0           ; ÚArg4_5
630E7AD6        ³.  53                push ebx                    ; ³Arg3 => [ARG.3]
630E7AD7        ³.  FF75 0C           push dword ptr [ebp+0C]     ; ³Arg2 => [ARG.2]
630E7ADA        ³.  FF75 08           push dword ptr [ebp+8]      ; ³Arg1 => [ARG.1]
630E7ADD        ³.  E8 AECAFFFF       call 630E4590               ; Àgsl.630E4590
630E7AE2        ³.  660F6ECB          movd xmm1, ebx          ; <<<< SIMD #######
630E7AE6        ³.  8BC3              mov eax, ebx
630E7AE8        ³.  F30FE6C9          cvtdq2pd xmm1, xmm1          ; <<<< SIMD #######
630E7AEC        ³.  C1E8 1F           shr eax, 1F
630E7AEF        ³.  83C4 14           add esp, 14
630E7AF2        ³.  F20F580CC5 00B912 addsd xmm1, [eax*8+6312B900
630E7AFB        ³.  8D43 FF           lea eax, [ebx-1]
630E7AFE        ³.  660F6EC0          movd xmm0, eax
630E7B02        ³.  F30FE6C0          cvtdq2pd xmm0, xmm0          ; <<<< SIMD #######
630E7B06        ³.  C1E8 1F           shr eax, 1F
630E7B09        ³.  5B                pop ebx
630E7B0A        ³.  F20F5804C5 00B912 addsd xmm0, [eax*8+6312B900
630E7B13        ³.  F20F5EC8          divsd xmm1, xmm0          ; <<<< SIMD #######
630E7B17        ³.  F20F114D F8       movsd [ebp-8], xmm1          ; <<<< SIMD #######
630E7B1C        ³.  DC4D F8           fmul qword ptr [ebp-8]
630E7B1F        ³.  8BE5              mov esp, ebp
630E7B21        ³.  5D                pop ebp
630E7B22        À.  C3                retn

That's the 32-bit version, of course. But it seems obvious that the guys have had the same idea before. I know that you have a problem with the GSL, but this stuff is so complicated, why reinvent the wheel if a bunch of real mathematicians have done the job already, and give you over 1,000 functions ready to be linked in?

There is a discussion comparing Eigen, GSL and others at ResearchGate (https://www.researchgate.net/post/What_is_the_best_numerical_library_c_c):

QuoteBut if you want speed, GSL is the best choice: http://www.gnu.org/software/gsl/

A comparison between Boost and GSL (http://boost%20vs%20gsl) says the latter is much faster (and Boost is a behemoth, too).

There is a good overview at University of Utah, CHPC - Research Computing Support for the University: Math Libraries (https://www.chpc.utah.edu/documentation/software/mathlibraries.php)

We had a long thread about Yeppp!, and while I just confirmed that this snippet still builds and runs fine (http://masm32.com/board/index.php?topic=2445.msg25742#msg25742) (provided you use ML 6.15 ... 10.0 - one of the rare cases where UAsm is not compatible...), it also turns out that the latest DLL doesn't use SIMD instructions. Yeppp! looks pretty dead :(

And then there is the Intel MKL, and guess what? We tested it already (http://masm32.com/board/index.php?topic=5735.msg61158#msg61158) :biggrin:

Title: Re: Creativity question.
Post by: LiaoMi on November 19, 2017, 08:10:44 PM

Hi everybody,

here an interesting library, you can estimate the amount of work https://github.com/VcDevel/Vc

Vc-master.zip\Vc-master\attic\sse - ZIP archive, unpacked size 4 280 387 bytes

Vc-master\attic\sse\casts.h
Vc-master\attic\sse\const.h
Vc-master\attic\sse\const_data.h
Vc-master\attic\sse\debug.h
Vc-master\attic\sse\deinterleave.tcc
Vc-master\attic\sse\detail.h
Vc-master\attic\sse\helperimpl.h
Vc-master\attic\sse\intrinsics.h
Vc-master\attic\sse\limits.h
Vc-master\attic\sse\macros.h
Vc-master\attic\sse\mask.h
Vc-master\attic\sse\mask.tcc
Vc-master\attic\sse\math.h
Vc-master\attic\sse\prefetches.tcc
Vc-master\attic\sse\shuffle.h
Vc-master\attic\sse\simd_cast.h
Vc-master\attic\sse\simd_cast_caller.tcc
Vc-master\attic\sse\type_traits.h
Vc-master\attic\sse\types.h
Vc-master\attic\sse\vector.h
Vc-master\attic\sse\vector.tcc
Vc-master\attic\sse\vectorhelper.h
Vc-master\attic\sse\vectorhelper.tcc

Vc is a free software library to ease explicit vectorization of C++ code. It has an intuitive API and provides portability between different compilers and compiler versions as well as portability between different vector instruction sets. Thus an application written with Vc can be compiled for:

AVX and AVX2
SSE2 up to SSE4.2 or SSE4a
Scalar
MIC (only before Vc 2.0)
AVX-512
NEON (in development)
NVIDIA GPUs / CUDA (research)

Title: Re: Creativity question.
Post by: hutch-- on November 19, 2017, 08:13:29 PM

I think you already know my response, I ask from GPL what I give to them, nothing as I don't need their licence or the strings that come with it. We used to have programmers once that wrote useful things, sad to see it disappear.

Title: Re: Creativity question.
Post by: aw27 on November 19, 2017, 08:25:09 PM

Quote from: hutch-- on November 19, 2017, 05:04:52 PM
I wonder how hard it would be to make a 64 bit SSE or later library to perform a range of common maths tasks ? I know you can do it up to 80 bit FP using the FP registers and mnemonics but I have done little work with maths in SSE or later and the aim with such an idea is to take advantage of the extra speed of SSE and later as against the older FP registers and mnemonics.

SSE (MMX and AVX as well) really shine for vectors but FP has lots of internal primitives (sin, cos, ln) that have no counterpart in SSE and have to be done in software. This can be quite involving.

As an example, the MASM equivalent of the DirectX XMScalarCos (64-bit).

Code Select



.code

XMScalarCos proc public 
	movaps xmm1, xmm0
	movaps xmm2, xmm0
	mulss xmm2, _XM_1DIV2PI ; xmm2=quocient
	comiss xmm1, _XM_REAL4ZERO
	.if ABOVEEQUAL?
		addss xmm2, _XM_REAL4HALF
		cvttss2si eax, xmm2
		cvtsi2ss xmm2, eax
	.else
		subss xmm2, _XM_REAL4HALF	
		cvttss2si eax, xmm2
		cvtsi2ss xmm2, eax
	.endif
	mulss xmm2, _XM_2PI 
	movss xmm1, xmm0
	subss xmm1, xmm2 ; xmm1=y
	movss xmm0, xmm1
	comiss xmm0, _XM_PIDIV2
	.if ABOVE?
		movss xmm2, _XM_PI
		subss xmm2, xmm1
		movss xmm1, xmm2
		movss xmm2, _XM_MINUSONE ; xmm2=sign
	.else	
		movss xmm0, xmm1
		comiss xmm0, _XM_MINUSPIDIV2
		.if BELOW?
			movss xmm2, _XM_MINUSPI
			subss xmm2, xmm1
			movss xmm1, xmm2
			movss xmm2, _XM_MINUSONE ; xmm2=sign			
		.else
			movss xmm2, _XM_PLUSONE
		.endif
	.endif
	mulss xmm1, xmm1 ; xmm1 now =y^2
	
	movss xmm3, _Constant1_XMScalarCos
	mulss xmm3, xmm1
	addss xmm3, _Constant2_XMScalarCos
	mulss xmm3, xmm1
	subss xmm3, _Constant3_XMScalarCos
	mulss xmm3, xmm1
	addss xmm3, _Constant4_XMScalarCos
	mulss xmm3, xmm1
	subss xmm3, _XM_REAL4HALF
	mulss xmm3, xmm1
	addss xmm3, _XM_PLUSONE
	
	mulss xmm2, xmm3
	movss xmm0, xmm2
	ret
XMScalarCos endp

This is just for illustration, if interested I can produce complete data to make the function(s) work. Actually it was done for Jwasm/Uasm.

Title: Re: Creativity question.
Post by: Siekmanski on November 20, 2017, 01:05:48 AM

Hi aw27,
I'm interested. :t

Title: Re: Creativity question.
Post by: jj2007 on November 20, 2017, 01:50:18 AM

Quote from: aw27 on November 19, 2017, 08:25:09 PMAs an example, the MASM equivalent of the DirectX XMScalarCos (64-bit).

Quote from: aw27 on November 18, 2017, 12:35:00 AMWhen I post examples or code it is always complete and ready to be built

::)

Title: Re: Creativity question.
Post by: aw27 on November 20, 2017, 02:11:28 AM

Welcome JJ! :t
This is part of my DMath library (https://www.codeproject.com/Tips/1174521/DMath-is-DirectXMath-for-All) which has more than 600 functions and is mostly a ASM translation of the DirectXMath. I can make available the source of a few functions that have some correspondence in the FP unit. But you can always see how it works looking at the Microsoft C++ open source code. :icon_rolleyes:

@Siekmanski
All right, I will PM you tomorrow.

Title: Re: Creativity question.
Post by: HSE on November 20, 2017, 02:42:01 AM

Hi Hutch!
You can try the SSE backend of SmplMath. In some test with 32bit the calculations take only 75% of FPU backend time, because just a portion of the program is using SmplMath.

Title: Re: Creativity question.
Post by: HSE on November 27, 2017, 09:02:30 AM

Hi Atelier!

Apparently I have forgotten how to compile C#. It's posible that you upload testApp.exe? Just to compare some results. Thanks.

The library DMath is very interesting, only a little limited if follow DirectXMath:

Quote from: MSDNDirectXMath supports vectors of 4 single-precision floating-point or four 32-bit (signed or unsigned) values.

Title: Re: Creativity question.
Post by: aw27 on November 27, 2017, 09:04:25 PM

Hi HSE,

Quote
Apparently I have forgotten how to compile C#. It's posible that you upload testApp.exe?

It is attached.

Quote
only a little limited if follow DirectXMath

I have also done all the collision functions, BoundingBox, BoundingFrustrum,BoundingOrientedBox,BoundingSphere, which are not part of the DirectxMath proper, and have tested with a modification of the Collision sample from the DirectxSdk (You can download the attachment in the next message because it goes over 512KB here). It outperforms the original.
However, I will not publish anything about these collision functions because I can't myself produce a decent Collision sample of my own.

Title: Re: Creativity question.
Post by: aw27 on November 27, 2017, 09:06:49 PM

Collision sample mentioned above.

Title: Re: Creativity question.
Post by: HSE on November 28, 2017, 01:22:16 AM

:t Thanks

Advanced graphics are beyond my scope but I see the collision idea. Text (buttons and listbox) are not visible in the computer that I runned the program.
I think you can modifed the colors of both objects in every collision, and randomly (or not) change intensities of R, G or B, or to follow a sequence of colors, ...

Title: Re: Creativity question.
Post by: aw27 on November 28, 2017, 02:00:43 AM

Quote from: HSE on November 28, 2017, 01:22:16 AM
Advanced graphics are beyond my scope but I see the collision idea. Text (buttons and listbox) are not visible in the computer that I runned the program.

I know there is a problem in some systems but it is an issue of the Microsoft sample itself, which may have been fixed in the meantime in the github repository. I really have no clue.

Title: Re: Creativity question.
Post by: Siekmanski on November 28, 2017, 02:53:56 AM

Thanks, cool stuff. 8)

The MASM Forum

General => The Workshop => Topic started by: hutch-- on November 19, 2017, 05:04:52 PM