Author Topic: Fast SIMD transpose routines  (Read 4604 times)

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #30 on: July 14, 2018, 08:45:29 AM »
Well, the arguments are passed in XMM0..3 and returned in XMM0..3 so the test will fail without this.
Code: [Select]
    movaps      [rdx+0],xmm4    ; [0 4 8 C]
    movaps      [rdx+16],xmm5   ; [1 5 9 D]
    movaps      [rdx+32],xmm6   ; [2 6 A E]
    movaps      [rdx+48],xmm2   ; [3 7 B F]
    ...
    movaps      xmm0,xmm4       ; [0 4 8 C]
    movaps      xmm1,xmm5       ; [1 5 9 D]
    movaps      xmm3,xmm2       ; [3 7 B F]
    movaps      xmm2,xmm6       ; [2 6 A E]

Maybe it's possible to rearrange the regs above to avoid some of this.

Siekmanski

  • Member
  • *****
  • Posts: 2330
Re: Fast SIMD transpose routines
« Reply #31 on: July 14, 2018, 08:54:30 AM »
No need for rearranging the regs if you include the memory reads and writes in the speed test.
Creative coders use backward thinking techniques as a strategy.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #32 on: July 14, 2018, 09:35:26 AM »
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different. As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.

removed xmm6 ( to make it simpler  :biggrin: )

Or flipping 4 and 0.. :biggrin:
Code: [Select]
    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]

    unpcklps    xmm0,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm4,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm0       ; [0 4 1 5]
    movaps      xmm6,xmm4       ; [2 6 3 7]
    movlhps     xmm0,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movhlps     xmm2,xmm4       ; [3 7 B F]

;   movaps      xmm0,xmm4       ; [0 4 8 C]
    movaps      xmm1,xmm5       ; [1 5 9 D]
    movaps      xmm3,xmm2       ; [3 7 B F]
    movaps      xmm2,xmm6       ; [2 6 A E]

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #33 on: July 14, 2018, 10:37:03 AM »
Simpler and faster..
Code: [Select]
    vunpckhps xmm4,xmm2,xmm3
    vunpcklps xmm2,xmm2,xmm3
    vunpckhps xmm3,xmm0,xmm1
    vunpcklps xmm1,xmm0,xmm1
    vmovlhps  xmm0,xmm1,xmm2
    vmovhlps  xmm1,xmm2,xmm1
    vmovlhps  xmm2,xmm3,xmm4
    vmovhlps  xmm3,xmm4,xmm3

Siekmanski

  • Member
  • *****
  • Posts: 2330
Re: Fast SIMD transpose routines
« Reply #34 on: July 14, 2018, 10:45:38 AM »
 :biggrin:
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different.

OK, that makes sense.

Quote
As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.

You must have confused me with someone else, never done an AVX version yet.
But will try it out some day.

I remember that we can replace VSHUFPS or VPERM2F128 with VBLENDPS instructions.
AVX shuffles are executed only on port 5, while blends are also executed on port 0.
VPERM2F128 instructions are not that fast.

Maybe we can get some gain out of it.

I will look this up.

EDIT: Found it, it's in chapter 12 section 11.1
http://members.home.nl/siekmanski/Intel_Optimization_Reference_Manual_248966-037.pdf
Creative coders use backward thinking techniques as a strategy.

Siekmanski

  • Member
  • *****
  • Posts: 2330
Re: Fast SIMD transpose routines
« Reply #35 on: July 14, 2018, 12:01:38 PM »
Simpler and faster..
Code: [Select]
    vunpckhps xmm4,xmm2,xmm3
    vunpcklps xmm2,xmm2,xmm3
    vunpckhps xmm3,xmm0,xmm1
    vunpcklps xmm1,xmm0,xmm1
    vmovlhps  xmm0,xmm1,xmm2
    vmovhlps  xmm1,xmm2,xmm1
    vmovlhps  xmm2,xmm3,xmm4
    vmovhlps  xmm3,xmm4,xmm3

Timed this AVX piece on my computer, it's a little slower than the SSE version.
Creative coders use backward thinking techniques as a strategy.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #36 on: July 14, 2018, 01:38:28 PM »
I fail to see any difference between them except from the code size, at least on this machine. The AMD I used didn't have AVX but the new box do so there is the possibility of testing.

The vector calling convention demands a more conservative use of registers so there are a few practical things you can do with AVX but I haven't really used it (yet).

Conservative SSE version:
Code: [Select]
    movaps   xmm4,xmm2
    unpckhps xmm4,xmm3
    unpcklps xmm2,xmm3
    movaps   xmm3,xmm0
    unpckhps xmm3,xmm1
    unpcklps xmm0,xmm1
    movaps   xmm1,xmm2
    movhlps  xmm1,xmm0
    movlhps  xmm0,xmm2
    movaps   xmm2,xmm3
    movlhps  xmm2,xmm4
    movhlps  xmm4,xmm3
    movaps   xmm3,xmm4

daydreamer

  • Member
  • *****
  • Posts: 1360
  • building nextdoor
Re: Fast SIMD transpose routines
« Reply #37 on: July 14, 2018, 05:00:13 PM »
great work
so this works great for feed d3d9 with loads of different matrices?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)