Fastest ever 32-bit Matrix Transposing

aw27 · November 06, 2018, 03:01:19 AM

Except for some very small matrices, the attached code should be as fast as we can go on 32-bit matrix transposing (well, using CUDA may be faster, or not be). It will work for any ROWSxCOLS matrix, simply edit the values for ROWS and COLS in the source code and rebuild.

The core code is only this for any ROWSxCOLS matrix:

Code Select


; ************ TRANSPOSING CODE START ***********************	
@anotherCol:	
	mov ebx, 0
	mov eax, 0
	.while ebx<vertLoads
		vmovdqa ymm2, YMMWORD PTR _mask
		VPGATHERDD ymm1, [edx+ymm0*1], ymm2
		vmovdqu YMMWORD PTR [edi+eax], ymm1
		inc ebx
		add edx, REGWIDTH*COLS*4
		add eax, REGWIDTH*4
	.endw	
	add esi, sizeof DWORD
	mov edx, esi
	add edi, ROWS*sizeof DWORD
	loop @anotherCol	
; ************ TRANSPOSING CODE ENDS ************************

You will need AVX2 to run the program due to the instruction VPGATHERDD, that is you need a Haswell CPU or later (some AMDs support it as well, I believe). Note the special form of addressing used by VPGATHERDD, called VSIB addressing.

A few output samples:

Code Select


Transposing a 1x3 Matrix
Before:
row 0   1       2       3
After:
row 0   1
row 1   2
row 2   3

Code Select


Transposing a 11x9 Matrix
Before:
row 0   1       2       3       4       5       6       7       8       9
row 1   10      11      12      13      14      15      16      17      18
row 2   19      20      21      22      23      24      25      26      27
row 3   28      29      30      31      32      33      34      35      36
row 4   37      38      39      40      41      42      43      44      45
row 5   46      47      48      49      50      51      52      53      54
row 6   55      56      57      58      59      60      61      62      63
row 7   64      65      66      67      68      69      70      71      72
row 8   73      74      75      76      77      78      79      80      81
row 9   82      83      84      85      86      87      88      89      90
row 10  91      92      93      94      95      96      97      98      99
After:
row 0   1       10      19      28      37      46      55      64      73      82      91
row 1   2       11      20      29      38      47      56      65      74      83      92
row 2   3       12      21      30      39      48      57      66      75      84      93
row 3   4       13      22      31      40      49      58      67      76      85      94
row 4   5       14      23      32      41      50      59      68      77      86      95
row 5   6       15      24      33      42      51      60      69      78      87      96
row 6   7       16      25      34      43      52      61      70      79      88      97
row 7   8       17      26      35      44      53      62      71      80      89      98
row 8   9       18      27      36      45      54      63      72      81      90      99

aw27 · November 06, 2018, 09:53:54 PM

We can also do it with AVX-512, which is arriving to the mainstream namely with the Intel 7800X CPU.
If you don't have AVX-512 you can use the Intel Emulator, as I did, which works very well.

Note that while using AVX-512 we will have to deal with another new concept called OPMASK REGISTERS.

* Tested as follows in the Intel Emulator (emulating Knights mill CPU):
"H:\Intel Emulator\sde-external-8.16.0-2018-01-30-win\sde" -knm -- trans32z.exe

The core code is this:

Code Select


; ************ TRANSPOSING CODE START ***********************	
@anotherCol:
	mov ebx, 0FFFFh
	kmovw k1, ebx
	mov ebx, 0
	xor eax, eax
	.while ebx<vertLoads
		VPGATHERDD zmm1{k1}, [edx+zmm0]
		vmovdqu32 ZMMWORD PTR [edi+eax], zmm1
		inc ebx
		add edx, REGWIDTH*COLS*4
		add eax, REGWIDTH*4
	.endw	
	add esi, sizeof DWORD
	mov edx, esi
	add edi, ROWS*sizeof DWORD
	loop @anotherCol	
; ************ TRANSPOSING CODE ENDS ************************

Sample output:

Code Select


Transposing a 9x7 Matrix
Before:
row 0   1       2       3       4       5       6       7
row 1   8       9       10      11      12      13      14
row 2   15      16      17      18      19      20      21
row 3   22      23      24      25      26      27      28
row 4   29      30      31      32      33      34      35
row 5   36      37      38      39      40      41      42
row 6   43      44      45      46      47      48      49
row 7   50      51      52      53      54      55      56
row 8   57      58      59      60      61      62      63
After:
row 0   1       8       15      22      29      36      43      50      57
row 1   2       9       16      23      30      37      44      51      58
row 2   3       10      17      24      31      38      45      52      59
row 3   4       11      18      25      32      39      46      53      60
row 4   5       12      19      26      33      40      47      54      61
row 5   6       13      20      27      34      41      48      55      62
row 6   7       14      21      28      35      42      49      56      63

The MASM Forum

News:

Fastest ever 32-bit Matrix Transposing

aw27

aw27