News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Code Snippets

Started by johnsa, February 14, 2018, 08:22:52 PM

Previous topic - Next topic

johnsa

I thought it might be useful (fun?) to start a sticky board of all our favourite code snippets, especially little pieces of SIMD logic. People often go looking for the best ways to accomplish some small simple task, and there are usually multiple solutions so it might be nice to catalogue them all in one place and rank them based on performance testing and compatibility (SSE2 / AVX etc).
It's also a good opportunity to ensure that anything we've got lying around and are using is as tight as possible.

As a starter from me:

SIMD FLOOR (Float to Dword) : (XMM0 -> EAX)

sngMinusOneHalf REAL4 -0.5
movss xmm0,val
addss xmm0,xmm0
addss xmm0,sngMinusOneHalf
cvtss2si eax,xmm0
sar eax,1


SIMD CEILING (Float to Dword) : (XMM0 -> EAX)

sngMinusOneHalf REAL4 -0.5
movss xmm0,val
addss xmm0,xmm0
movss xmm1,sngMinusOneHalf
subss xmm1,xmm0
cvtss2si eax,xmm1
sar eax,1
neg eax


SIMD 4 Packed ARGB pixels to 4 normalized floating point. [ARGB|ARGB|ARGB|ARGB] (0 .. 255) -> [A|R|G|B] (0.0 .. 1.0) x 4
[RDI] -> (xmm0, xmm1, xmm2, xmm3)

   
INT2FPCOL     __m128f < 0.00392157, 0.00392157, 0.00392157, 0.00392157 >        ; [0 - 255] -> [0.0 - 1.0] colour conversion factor.

    vmovdqa xmm0,[rdi]          ; xmm0 = | argb1 | argb2 | argb3 | argb4 |
   
    vpshufd xmm1,xmm0,00000001b ; xmm1 = | - | - | - | argb2 |
    vpshufd xmm2,xmm0,00000010b ; xmm2 = | - | - | - | argb3 |
    vpshufd xmm3,xmm0,00000011b ; xmm3 = | - | - | - | argb4 |

    vmovaps xmm4,INT2FPCOL     
   
    vpmovzxbd xmm0,xmm0
    vpmovzxbd xmm1,xmm1
    vcvtdq2ps xmm0,xmm0         ; xmm0 = | (float)a1 | (float)r1 | (float)g1 | (float)b1 |
    vcvtdq2ps xmm1,xmm1         ; xmm1 = | (float)a2 | (float)r2 | (float)g2 | (float)b2 |
    vpmovzxbd xmm2,xmm2
    vpmovzxbd xmm3,xmm3
    vcvtdq2ps xmm2,xmm2         ; xmm2 = | (float)a3 | (float)r3 | (float)g3 | (float)b3 |
    vcvtdq2ps xmm3,xmm3         ; xmm3 = | (float)a4 | (float)r4 | (float)g4 | (float)b4 |

    vmulps xmm0,xmm0,xmm4
    vmulps xmm1,xmm1,xmm4
    vmulps xmm2,xmm2,xmm4
    vmulps xmm3,xmm3,xmm4



SIMD 4 normalized floating point pixels to packed ARGB.
(xmm0, xmm1, xmm2, xmm3) -> [RDI]


FP2INTCOL     __m128f < 255.0, 255.0, 255.0, 255.0 >                                            ; [0.0 - 1.0] -> [0 - 255] colour conversion factor.   

    vmovaps xmm4,FP2INTCOL     
    vmulps xmm0,xmm0,xmm4
    vmulps xmm1,xmm1,xmm4
    vmulps xmm2,xmm2,xmm4
    vmulps xmm3,xmm3,xmm4

    vcvttps2dq xmm0,xmm0
    vcvttps2dq xmm1,xmm1
    vpackssdw xmm0,xmm0,xmm1
    vcvttps2dq xmm2,xmm2
    vcvttps2dq xmm3,xmm3   
    vpackssdw xmm2,xmm2,xmm3
    vpackuswb xmm0,xmm0,xmm2
    vmovdqa [rdi],xmm0



SIMD PACKED FLOAT ABSOLUTE VALUE (without a mask)

vpslld xmm0,xmm0,1
vpsrld xmm0,xmm0,1


SIMD PACKED FLOAT ABSOLUTE  VALUE (using a mask outside a loop)

vpcmpeqd xmm5,xmm5,xmm5
vpsrld xmm5,xmm5,1
; loop here...
vandps xmm0, xmm0, xmm5


SINGLE DWORD or QWORD absolute value in GPRS

;32bit
cdq
xor eax, edx
sub eax, edx

;64bit
cqo
xor rax,rdx
sub rax,rdx


PACKED INTEGER ABS

; Just use VPABSD or equivalents.


Feel free to add to this list or suggest improved versions.

johnsa

Two more ..

SIMD PACKED SINE and COSINE approximations:
Range -Pi to Pi, produces more accurate results than taylor series expansion and is considerably faster:



PI   REAL4 3.1415926535898
PI_2 REAL4 1.5707963267949
PIx2 REAL4 6.2831853071796

align 16
P       __m128f < 0.225, 0.225, 0.225, 0.255 >
B1      __m128f < 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801 >     ; 4/pi
C1      __m128f < -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891 > ;-4/(pi*pi)
ABSMASK __m128i < 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff >
VPI     __m128f < 3.1415926535898, 3.1415926535898, 3.1415926535898, 3.1415926535898 >
VPI_2   __m128f < 1.5707963267949, 1.5707963267949, 1.5707963267949, 1.5707963267949 >
VPIx2   __m128f < 6.2831853071796, 6.2831853071796, 6.2831853071796, 6.2831853071796 >

FSINE MACRO
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM

FCOSS MACRO
;// cos(x) = sin(x + pi/2)
vaddps xmm0, xmm0, VPI_2
vcmpnltps xmm1, xmm0, VPI
vandps xmm1, xmm1, VPIx2
vsubps xmm0, xmm0, xmm1
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM



Would be good to add decent approximations for log/exp/pow/tan/atan too .. If anyone has anything lying around.

HSE

Hi Johnsa!

This is an old library very interesting. There is new libraries, but they require 64-bit for installation. (there was some thread not so long ago). Regards.
Equations in Assembly: SmplMath