Fast SIMD transpose routines

nidud · July 14, 2018, 08:45:29 AM

deleted

Siekmanski · July 14, 2018, 08:54:30 AM

No need for rearranging the regs if you include the memory reads and writes in the speed test.

nidud · July 14, 2018, 09:35:26 AM

deleted

nidud · July 14, 2018, 10:37:03 AM

deleted

Siekmanski · July 14, 2018, 10:45:38 AM

Quote from: nidud on July 14, 2018, 09:35:26 AM
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different.

OK, that makes sense.

Quote
As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.

You must have confused me with someone else, never done an AVX version yet.
But will try it out some day.

I remember that we can replace VSHUFPS or VPERM2F128 with VBLENDPS instructions.
AVX shuffles are executed only on port 5, while blends are also executed on port 0.
VPERM2F128 instructions are not that fast.

Maybe we can get some gain out of it.

I will look this up.

EDIT: Found it, it's in chapter 12 section 11.1
http://members.home.nl/siekmanski/Intel_Optimization_Reference_Manual_248966-037.pdf

Siekmanski · July 14, 2018, 12:01:38 PM

Quote from: nidud on July 14, 2018, 10:37:03 AM
Simpler and faster..
Code Select Expand
vunpckhps xmm4,xmm2,xmm3 vunpcklps xmm2,xmm2,xmm3 vunpckhps xmm3,xmm0,xmm1 vunpcklps xmm1,xmm0,xmm1 vmovlhps xmm0,xmm1,xmm2 vmovhlps xmm1,xmm2,xmm1 vmovlhps xmm2,xmm3,xmm4 vmovhlps xmm3,xmm4,xmm3

Timed this AVX piece on my computer, it's a little slower than the SSE version.

nidud · July 14, 2018, 01:38:28 PM

deleted

daydreamer · July 14, 2018, 05:00:13 PM

great work
so this works great for feed d3d9 with loads of different matrices?

The MASM Forum

News:

Fast SIMD transpose routines

nidud

Siekmanski

nidud

nidud

Siekmanski

Siekmanski

nidud

daydreamer