I thought it might be useful (fun?) to start a sticky board of all our favourite code snippets, especially little pieces of SIMD logic. People often go looking for the best ways to accomplish some small simple task, and there are usually multiple solutions so it might be nice to catalogue them all in one place and rank them based on performance testing and compatibility (SSE2 / AVX etc).
It's also a good opportunity to ensure that anything we've got lying around and are using is as tight as possible.
As a starter from me:
SIMD FLOOR (Float to Dword) : (XMM0 -> EAX)
sngMinusOneHalf REAL4 -0.5
movss xmm0,val
addss xmm0,xmm0
addss xmm0,sngMinusOneHalf
cvtss2si eax,xmm0
sar eax,1
SIMD CEILING (Float to Dword) : (XMM0 -> EAX)
sngMinusOneHalf REAL4 -0.5
movss xmm0,val
addss xmm0,xmm0
movss xmm1,sngMinusOneHalf
subss xmm1,xmm0
cvtss2si eax,xmm1
sar eax,1
neg eax
SIMD 4 Packed ARGB pixels to 4 normalized floating point. [ARGB|ARGB|ARGB|ARGB] (0 .. 255) -> [A|R|G|B] (0.0 .. 1.0) x 4
[RDI] -> (xmm0, xmm1, xmm2, xmm3)
INT2FPCOL __m128f < 0.00392157, 0.00392157, 0.00392157, 0.00392157 > ; [0 - 255] -> [0.0 - 1.0] colour conversion factor.
vmovdqa xmm0,[rdi] ; xmm0 = | argb1 | argb2 | argb3 | argb4 |
vpshufd xmm1,xmm0,00000001b ; xmm1 = | - | - | - | argb2 |
vpshufd xmm2,xmm0,00000010b ; xmm2 = | - | - | - | argb3 |
vpshufd xmm3,xmm0,00000011b ; xmm3 = | - | - | - | argb4 |
vmovaps xmm4,INT2FPCOL
vpmovzxbd xmm0,xmm0
vpmovzxbd xmm1,xmm1
vcvtdq2ps xmm0,xmm0 ; xmm0 = | (float)a1 | (float)r1 | (float)g1 | (float)b1 |
vcvtdq2ps xmm1,xmm1 ; xmm1 = | (float)a2 | (float)r2 | (float)g2 | (float)b2 |
vpmovzxbd xmm2,xmm2
vpmovzxbd xmm3,xmm3
vcvtdq2ps xmm2,xmm2 ; xmm2 = | (float)a3 | (float)r3 | (float)g3 | (float)b3 |
vcvtdq2ps xmm3,xmm3 ; xmm3 = | (float)a4 | (float)r4 | (float)g4 | (float)b4 |
vmulps xmm0,xmm0,xmm4
vmulps xmm1,xmm1,xmm4
vmulps xmm2,xmm2,xmm4
vmulps xmm3,xmm3,xmm4
SIMD 4 normalized floating point pixels to packed ARGB.
(xmm0, xmm1, xmm2, xmm3) -> [RDI]
FP2INTCOL __m128f < 255.0, 255.0, 255.0, 255.0 > ; [0.0 - 1.0] -> [0 - 255] colour conversion factor.
vmovaps xmm4,FP2INTCOL
vmulps xmm0,xmm0,xmm4
vmulps xmm1,xmm1,xmm4
vmulps xmm2,xmm2,xmm4
vmulps xmm3,xmm3,xmm4
vcvttps2dq xmm0,xmm0
vcvttps2dq xmm1,xmm1
vpackssdw xmm0,xmm0,xmm1
vcvttps2dq xmm2,xmm2
vcvttps2dq xmm3,xmm3
vpackssdw xmm2,xmm2,xmm3
vpackuswb xmm0,xmm0,xmm2
vmovdqa [rdi],xmm0
SIMD PACKED FLOAT ABSOLUTE VALUE (without a mask)
vpslld xmm0,xmm0,1
vpsrld xmm0,xmm0,1
SIMD PACKED FLOAT ABSOLUTE VALUE (using a mask outside a loop)
vpcmpeqd xmm5,xmm5,xmm5
vpsrld xmm5,xmm5,1
; loop here...
vandps xmm0, xmm0, xmm5
SINGLE DWORD or QWORD absolute value in GPRS
;32bit
cdq
xor eax, edx
sub eax, edx
;64bit
cqo
xor rax,rdx
sub rax,rdx
PACKED INTEGER ABS
; Just use VPABSD or equivalents.
Feel free to add to this list or suggest improved versions.
Two more ..
SIMD PACKED SINE and COSINE approximations:
Range -Pi to Pi, produces more accurate results than taylor series expansion and is considerably faster:
PI REAL4 3.1415926535898
PI_2 REAL4 1.5707963267949
PIx2 REAL4 6.2831853071796
align 16
P __m128f < 0.225, 0.225, 0.225, 0.255 >
B1 __m128f < 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801, 1.2732395447351626861510701069801 > ; 4/pi
C1 __m128f < -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891, -0.40528473456935108577551785283891 > ;-4/(pi*pi)
ABSMASK __m128i < 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff >
VPI __m128f < 3.1415926535898, 3.1415926535898, 3.1415926535898, 3.1415926535898 >
VPI_2 __m128f < 1.5707963267949, 1.5707963267949, 1.5707963267949, 1.5707963267949 >
VPIx2 __m128f < 6.2831853071796, 6.2831853071796, 6.2831853071796, 6.2831853071796 >
FSINE MACRO
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM
FCOSS MACRO
;// cos(x) = sin(x + pi/2)
vaddps xmm0, xmm0, VPI_2
vcmpnltps xmm1, xmm0, VPI
vandps xmm1, xmm1, VPIx2
vsubps xmm0, xmm0, xmm1
;// Parabola
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vmulps xmm0, xmm0, B1
vmulps xmm1, xmm1, C1
vaddps xmm0, xmm0, xmm1
;// Extra precision
vandps xmm1, xmm0, ABSMASK
vmulps xmm1, xmm1, xmm0
vsubps xmm1, xmm1, xmm0
vmulps xmm1, xmm1, P
vaddps xmm0, xmm0, xmm1
ENDM
Would be good to add decent approximations for log/exp/pow/tan/atan too .. If anyone has anything lying around.
Hi Johnsa!
This is an old library very interesting. There is new libraries, but they require 64-bit for installation. (there was some thread not so long ago). Regards.