Code Snippets

johnsa · February 14, 2018, 08:22:52 PM

I thought it might be useful (fun?) to start a sticky board of all our favourite code snippets, especially little pieces of SIMD logic. People often go looking for the best ways to accomplish some small simple task, and there are usually multiple solutions so it might be nice to catalogue them all in one place and rank them based on performance testing and compatibility (SSE2 / AVX etc).
It's also a good opportunity to ensure that anything we've got lying around and are using is as tight as possible.

As a starter from me:

SIMD FLOOR (Float to Dword) : (XMM0 -> EAX)

Code Select


sngMinusOneHalf REAL4 -0.5
	movss xmm0,val
	addss xmm0,xmm0
	addss xmm0,sngMinusOneHalf
	cvtss2si eax,xmm0
	sar eax,1

SIMD CEILING (Float to Dword) : (XMM0 -> EAX)

Code Select


sngMinusOneHalf REAL4 -0.5
	movss xmm0,val
	addss xmm0,xmm0
	movss xmm1,sngMinusOneHalf
	subss xmm1,xmm0
	cvtss2si eax,xmm1
	sar eax,1
	neg eax

Code Select


    
INT2FPCOL     __m128f < 0.00392157, 0.00392157, 0.00392157, 0.00392157 >        ; [0 - 255] -> [0.0 - 1.0] colour conversion factor.

    vmovdqa xmm0,[rdi]          ; xmm0 = | argb1 | argb2 | argb3 | argb4 |
    
    vpshufd xmm1,xmm0,00000001b ; xmm1 = | - | - | - | argb2 |
    vpshufd xmm2,xmm0,00000010b ; xmm2 = | - | - | - | argb3 |
    vpshufd xmm3,xmm0,00000011b ; xmm3 = | - | - | - | argb4 |

    vmovaps xmm4,INT2FPCOL     
    
    vpmovzxbd xmm0,xmm0
    vpmovzxbd xmm1,xmm1
    vcvtdq2ps xmm0,xmm0         ; xmm0 = | (float)a1 | (float)r1 | (float)g1 | (float)b1 | 
    vcvtdq2ps xmm1,xmm1         ; xmm1 = | (float)a2 | (float)r2 | (float)g2 | (float)b2 | 
    vpmovzxbd xmm2,xmm2
    vpmovzxbd xmm3,xmm3
    vcvtdq2ps xmm2,xmm2         ; xmm2 = | (float)a3 | (float)r3 | (float)g3 | (float)b3 | 
    vcvtdq2ps xmm3,xmm3         ; xmm3 = | (float)a4 | (float)r4 | (float)g4 | (float)b4 | 

    vmulps xmm0,xmm0,xmm4
    vmulps xmm1,xmm1,xmm4
    vmulps xmm2,xmm2,xmm4
    vmulps xmm3,xmm3,xmm4

SIMD 4 normalized floating point pixels to packed ARGB.
(xmm0, xmm1, xmm2, xmm3) -> [RDI]

Code Select



FP2INTCOL     __m128f < 255.0, 255.0, 255.0, 255.0 >                                            ; [0.0 - 1.0] -> [0 - 255] colour conversion factor.   

    vmovaps xmm4,FP2INTCOL     
    vmulps xmm0,xmm0,xmm4
    vmulps xmm1,xmm1,xmm4
    vmulps xmm2,xmm2,xmm4
    vmulps xmm3,xmm3,xmm4

    vcvttps2dq xmm0,xmm0
    vcvttps2dq xmm1,xmm1
    vpackssdw xmm0,xmm0,xmm1
    vcvttps2dq xmm2,xmm2
    vcvttps2dq xmm3,xmm3   
    vpackssdw xmm2,xmm2,xmm3
    vpackuswb xmm0,xmm0,xmm2
    vmovdqa [rdi],xmm0

SIMD PACKED FLOAT ABSOLUTE VALUE (without a mask)

Code Select


vpslld xmm0,xmm0,1
vpsrld xmm0,xmm0,1

SIMD PACKED FLOAT ABSOLUTE VALUE (using a mask outside a loop)

Code Select


vpcmpeqd xmm5,xmm5,xmm5
vpsrld xmm5,xmm5,1
; loop here...
vandps xmm0, xmm0, xmm5

SINGLE DWORD or QWORD absolute value in GPRS

Code Select


;32bit
cdq
xor eax, edx
sub eax, edx

;64bit
cqo
xor rax,rdx
sub rax,rdx

PACKED INTEGER ABS

Code Select


; Just use VPABSD or equivalents.

Feel free to add to this list or suggest improved versions.

johnsa · February 16, 2018, 10:58:31 PM

Two more ..

SIMD PACKED SINE and COSINE approximations:
Range -Pi to Pi, produces more accurate results than taylor series expansion and is considerably faster:

Code Select



PI   REAL4 3.1415926535898
PI_2 REAL4 1.5707963267949
PIx2 REAL4 6.2831853071796

align 16
P       __m128f < 0.225, 0.225, 0.225, 0.255 >
B1      __m128f < 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801 >     ; 4/pi
C1      __m128f < -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891 > ;-4/(pi*pi)
ABSMASK __m128i < 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff >
VPI     __m128f < 3.1415926535898, 3.1415926535898, 3.1415926535898, 3.1415926535898 >
VPI_2   __m128f < 1.5707963267949, 1.5707963267949, 1.5707963267949, 1.5707963267949 >
VPIx2   __m128f < 6.2831853071796, 6.2831853071796, 6.2831853071796, 6.2831853071796 >

FSINE MACRO
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM

FCOSS MACRO
;// cos(x) = sin(x + pi/2)
vaddps xmm0, xmm0, VPI_2
vcmpnltps xmm1, xmm0, VPI
vandps xmm1, xmm1, VPIx2
vsubps xmm0, xmm0, xmm1
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM

Would be good to add decent approximations for log/exp/pow/tan/atan too .. If anyone has anything lying around.

HSE · February 16, 2018, 11:52:14 PM

Hi Johnsa!

This is an old library very interesting. There is new libraries, but they require 64-bit for installation. (there was some thread not so long ago). Regards.

The MASM Forum

News:

Code Snippets

johnsa

johnsa

HSE