Matrix Transposing - the CUDA way

aw27 · May 24, 2019, 12:59:00 AM

I believe some people have been missing discussions about Matrix Transposing (needing some irony emoticon here, I will make a note to roll one as soon as possible).
I remember, when people were stacking up some algorithms against other algorithms, according to results obtained under unclear conditions and produced lengthy reports that nobody in their good mind would bother to read.

As we were expecting, Matrix Transposing using CUDA really makes a big difference, not a small one but differences approaching 2 orders of magnitude (in other words, not far from 100 times).

Based in this article and in a sample from the NVidia SDK (most likely from the same author), I performed some tests with that sample and got this:

For a 1024x1024 matrix size:

These results differ some from those in the article, but the article is six years old, so I concluded that except for "transpose naive" all other 6 transpose methods are roughly equivalent. However, my selection went to "transpose coalesced" because the name left me a bit intrigued (may be I will try other methods later, as an exercise).

To make a long story short, I made a program using UASM, for the CUDA part I used the Driver API and selected the "transpose coalesced" method, as mentioned before.

The performance of my program is better than the one produced by the NVidia sample above both because I used (U)ASM (may be ASM is a reason) and because I used the Driver API (which is faster than the runtime API used in the NVidia sample). For the 1024x1024 matrix I got:
Throughput = 139.0773 GB/s, Time = 0.06032 ms

The distributed program, in the attachment below, was assembled for a matrix of 2048x2048. The program performs 100 repetitions and results are the average (total divided by 100 then). If you modify the program for larger matrix sizes, such as 8192x8192, it may take a noticeable time (but under 1 minute for the 100 repetitions).

I tested the program in 3 computers, the oldest one with a 7 year old GeForce GT 640 which has capability level 3 (the minimum that my SDK supports). With an older SDK I believe it would be possible to build an adequate .PTX and the older cuda.lib produce something that works in older graphics cards.

If you have a relatively recent NVidia card and the program does not work, the likely reason are outdated drivers. Go to Control Panel/Device Manager/Display Drivers and choose Update Drivers from the context menu.

Code Select


GPU Device 0: 'GeForce GTX 1060 6GB' with compute capability 6.1

*** CPU Transpose - Matrix size: 2048x2048 ***
Throughput = 1.7072 GB/s, Time = 19.65434 ms

*** GPU Transpose - Matrix size: 2048x2048 (128x128 tiles), tile size: 16x16, block size: 16x16 ***
CUDA Transpose Coalesced Method: Throughput = 135.0527 GB/s, Time = 0.24845 ms

Comparation of CPU and GPU Transposed Matrices: PERFECT MATCH!
<Press any key to Exit>

LiaoMi · May 24, 2019, 03:59:53 AM

Hi AW,

my test results

Code Select

GPU Device 0: 'Quadro P4000' with compute capability 6.1

*** CPU Transpose - Matrix size: 2048x2048 ***
Throughput = 0.9495 GB/s, Time = 35.33960 ms

*** GPU Transpose - Matrix size: 2048x2048 (128x128 tiles), tile size: 16x16, block size: 16x16 ***
CUDA Transpose Coalesced Method: Throughput = 155.3354 GB/s, Time = 0.21601 ms

Comparation of CPU and GPU Transposed Matrices: PERFECT MATCH!

<Press any key to Exit>

Cool example :Thmbsup:

daydreamer · May 24, 2019, 04:43:23 AM

but isnt matrix transpose what d3d api use when instance loads of same meshes under the hood?
same meshes,only some matrice and color data for each object that change rotation,size,translate and different color

jimg · May 24, 2019, 05:53:40 AM

I thought I'd give this a try, but no luck, even after updating drivers and rebooting. I get

GPU Device 0: 'GeForce GTX 1080' with compute capability 6.1

cuModuleLoadData failed.

using NVDUDA.DLL version 09.17.0524 (NVIDIA CUDA 9.2.217 driver)
any ideas?

LiaoMi · May 24, 2019, 07:48:29 AM

Quote from: jimg on May 24, 2019, 05:53:40 AM
I thought I'd give this a try, but no luck, even after updating drivers and rebooting. I get

GPU Device 0: 'GeForce GTX 1080' with compute capability 6.1

cuModuleLoadData failed.

using NVDUDA.DLL version 09.17.0524 (NVIDIA CUDA 9.2.217 driver)
any ideas?

NV DU DA DLL it sounds funny in german

NVidia Du Da dll!

Can you try to install Nvidia SDK 10.1 ?!

QuoteAW
Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.

aw27 · May 24, 2019, 07:51:18 AM

@LiaoMi
Thank you for testing, I am happy it works great for you.

@

Quote from: daydreamer on May 24, 2019, 04:43:23 AM
but isnt matrix transpose what d3d api use when instance loads of same meshes under the hood?
same meshes,only some matrice and color data for each object that change rotation,size,translate and different color

Both direct3d and CUDA (and others) interface directly with graphics card drivers. However, direct3d is more directed to graphics while CUDA is more general purpose. CUDA can also work with direct3d (and OpenGL) and there are samples in the SDK for that. A few things that CUDA does can be done in d3d and vice-versa. CUDA is probably more directed to the scientific community which don't want to do matrix calculations in pixel shaders.

Quote from: jimg on May 24, 2019, 05:53:40 AM
I thought I'd give this a try, but no luck, even after updating drivers and rebooting. I get

GPU Device 0: 'GeForce GTX 1080' with compute capability 6.1
cuModuleLoadData failed.
using NVDUDA.DLL version 09.17.0524 (NVIDIA CUDA 9.2.217 driver)
any ideas?

I have updated today one with GeForce GTX 1050 which was producing that same error.
It updated to
NVidia Cuda 10.1.135
23 April 2019

jimg · May 24, 2019, 10:08:34 AM

Got it. Had to load that ungawdly geforce experience and then load the "Creator Ready Driver".

results:
GPU Device 0: 'GeForce GTX 1080' with compute capability 6.1

*** CPU Transpose - Matrix size: 2048x2048 ***
Throughput = 1.4749 GB/s, Time = 22.75060 ms

*** GPU Transpose - Matrix size: 2048x2048 (128x128 tiles), tile size: 16x16, block size: 16x16 ***
CUDA Transpose Coalesced Method: Throughput = 231.3093 GB/s, Time = 0.14506 ms

Comparation of CPU and GPU Transposed Matrices: PERFECT MATCH!

aw27 · May 24, 2019, 04:17:37 PM

Quote from: jimg on May 24, 2019, 10:08:34 AM
CUDA Transpose Coalesced Method: Throughput = 231.3093 GB/s, Time = 0.14506 ms

Pornographically fast graphics card.

Siekmanski · May 24, 2019, 06:30:08 PM

Have you guys tried 256*256 tiles? I think they are processed faster.

LiaoMi · May 24, 2019, 07:14:57 PM

Quote from: AW on May 24, 2019, 04:17:37 PM
Quote from: jimg on May 24, 2019, 10:08:34 AM
CUDA Transpose Coalesced Method: Throughput = 231.3093 GB/s, Time = 0.14506 ms

Pornographically fast graphics card.

It is surprising that the test results for speed almost always correspond to previous tests. The results are fairly stable .. It would be interesting to check the RTX version for statistics

On the quadro there are different settings for CAD programs, I tried every option, the best result on standard settings. Surprisingly, NVIDIA quite accurately estimates the performance of video cards, in percentage terms, the results are very accurate. Then, according to the documentation, the performance gain on the RTX version will be 20 percent.

aw27 · May 24, 2019, 07:54:33 PM

Quote from: Siekmanski on May 24, 2019, 06:30:08 PM
Have you guys tried 256*256 tiles? I think they are processed faster.

I don't think it will work properly for this "coalesced" model. However, it may work for 32x32, but need to check, because for compute capability 2+ "coalescing" is done in warps. I am only speculating.

Quote from: LiaoMi on May 24, 2019, 07:14:57 PM
Then, according to the documentation, the performance gain on the RTX version will be 20 percent.

Hopefully, someone will step forward and state he/she has an RTX board.

aw27 · May 24, 2019, 09:04:57 PM

I performed some tests and it will not work with tiles and blocks above 32.
BTW, this is the kernel original code:

Code Select


#define TILE_DIM    16
#define BLOCK_ROWS  16

// coalesced transpose (with bank conflicts)

extern "C" __global__ void transposeCoalesced(float *odata, float *idata, int width, int height)
{
	__shared__ float tile[TILE_DIM][TILE_DIM];

	int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
	int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
	int index_in = xIndex + (yIndex)*width;

	xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
	yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
	int index_out = xIndex + (yIndex)*height;

	for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)
	{
		tile[threadIdx.y + i][threadIdx.x] = idata[index_in + i * width];
	}

	__syncthreads();

	for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)
	{
		odata[index_out + i * height] = tile[threadIdx.x][threadIdx.y + i];
	}
}

So,

Tiles 32, blocks 32
Tiles 32, blocks 16
Tiles 32, blocks 8

work, but perform worse than my original proposal.

If you want to test, you need to compile the kernel according to new values for tiles/blocks and also modify uinc.inc

PS:
Actually, the function
extern "C" __global__ void transposeCoalesced(float *odata, float *idata, int width, int height) should have been declared
as
extern "C" __global__ void transposeCoalesced(int *odata, int *idata, int width, int height) but will not make any difference in this case.

The MASM Forum

News:

Matrix Transposing - the CUDA way

aw27

LiaoMi

daydreamer

jimg

LiaoMi

aw27

jimg

aw27

Siekmanski

LiaoMi

aw27

aw27