Hi all, Here we have 1 version to multiply any square matrix MxM by any square matrix KxK (M=K) using SSE instructions and 1 that uses FPU. With this work we may solve all cases as we want: invoke MultiplyMxN_NxK_v6SSE, pMatX, pMatY, pMatXY ; general case invoke MultiplyMxM_MxM_v1SSE, pMatX, pMatXY ; particular case Y=X invoke MultiplyMxM_KxK_v1SSE, pMatX, pMatY, pMatXY ; particular case Y<>Xnote: each matrix used has a dimension behind the address and it seems to be the best way to test the procedures. If you want modify the procedure and pass the dimensions of each matrix. Note also that each procedure doesnt call any other procedure inside, so the code is large or very large especially MultiplyMxN_NxK_v6SSEQuote
VERSION 1:
PROCEDURE: MultiplyMxM_KxK_v1SSE
FILE: multiplySSEMxM_KxK_v1.inc
MACROS: multiplyMxN_KxK_v2A.mac
multiplyMxN_KxK_v2B.mac
basicmulMxN_KxK_v1.mac
VERSION FPU:
PROCEDURE: MultiplyMxM_KxK_v1FPU
FILE: multiplyFPUMxM_KxK_v1.inc
DOCUMENTATION: TEXT_ABOUT_MULTIPLY_SSE_REAL4.txt
MATRIX DEFINITION: We must define any matrixX as this
ALIGN 16
dd ?
dd ?
dd M ; <<--- number of columns
dd M ; <<--- number of lines
matrixX dd (M*M) dup (?)
If we want to alloc memory, see the file AllocMemory.inc
VERIFY SSE PROCEDURES: Use multiplyMxM_KxK_v1.exe/asm
Please test it in your CPU (i5/i7/AMD).
Use ExecuteTestmultiplyMxM_KxK_SSEv1.bat and post the file ResultsmultiplyMxM_KxK_v1.txt.
Good luckRuiLoureiroSome results: :t
Quote
***** Time table - LoopCount =100 000 *****
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
50 cycles, MultiplyMxM_KxK_v1SSE, MatrixX4x4 * MatrixY4x4
319 cycles, MultiplyMxM_KxK_v1FPU, MatrixX4x4 * MatrixY4x4
264 cycles, MultiplyMxM_KxK_v1SSE, MatrixX8x8 * MatrixY8x8
2031 cycles, MultiplyMxM_KxK_v1FPU, MatrixX8x8 * MatrixY8x8
745 cycles, MultiplyMxM_KxK_v1SSE, MatrixX10x10 * MatrixY10x10
3512 cycles, MultiplyMxM_KxK_v1FPU, MatrixX10x10 * MatrixY10x10
769 cycles, MultiplyMxM_KxK_v1SSE, MatrixX12x12 * MatrixY12x12
5827 cycles, MultiplyMxM_KxK_v1FPU, MatrixX12x12 * MatrixY12x12
891 cycles, MultiplyMxM_KxK_v1SSE, MatrixX11x11 * MatrixY11x11
4573 cycles, MultiplyMxM_KxK_v1FPU, MatrixX11x11 * MatrixY11x11
1402 cycles, MultiplyMxM_KxK_v1SSE, MatrixX13x13 * MatrixY13x13
7312 cycles, MultiplyMxM_KxK_v1FPU, MatrixX13x13 * MatrixY13x13
1792 cycles, MultiplyMxM_KxK_v1SSE, MatrixX14x14 * MatrixY14x14
9026 cycles, MultiplyMxM_KxK_v1FPU, MatrixX14x14 * MatrixY14x14
1806 cycles, MultiplyMxM_KxK_v1SSE, MatrixX16x16 * MatrixY16x16
13263 cycles, MultiplyMxM_KxK_v1FPU, MatrixX16x16 * MatrixY16x16
1980 cycles, MultiplyMxM_KxK_v1SSE, MatrixX15x15 * MatrixY15x15
11000 cycles, MultiplyMxM_KxK_v1FPU, MatrixX15x15 * MatrixY15x15
3343 cycles, MultiplyMxM_KxK_v1SSE, MatrixX17x17 * MatrixY17x17
15792 cycles, MultiplyMxM_KxK_v1FPU, MatrixX17x17 * MatrixY17x17
3410 cycles, MultiplyMxM_KxK_v1SSE, MatrixX20x20 * MatrixY20x20 ; <<< align 16 effect
25353 cycles, MultiplyMxM_KxK_v1FPU, MatrixX20x20 * MatrixY20x20
3690 cycles, MultiplyMxM_KxK_v1SSE, MatrixX18x18 * MatrixY18x18
18647 cycles, MultiplyMxM_KxK_v1FPU, MatrixX18x18 * MatrixY18x18
4119 cycles, MultiplyMxM_KxK_v1SSE, MatrixX19x19 * MatrixY19x19
21822 cycles, MultiplyMxM_KxK_v1FPU, MatrixX19x19 * MatrixY19x19
14019 cycles, MultiplyMxM_KxK_v1SSE, MatrixX32x32 * MatrixY32x32 ; 13.8% ...
101255 cycles, MultiplyMxM_KxK_v1FPU, MatrixX32x32 * MatrixY32x32 ; 7.2 * 14019
LiaoMi
***** Time table - LoopCount =100 000 *****
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
39 cycles, MultiplyMxM_KxK_v1SSE, MatrixX4x4 * MatrixY4x4
229 cycles, MultiplyMxM_KxK_v1FPU, MatrixX4x4 * MatrixY4x4
177 cycles, MultiplyMxM_KxK_v1SSE, MatrixX8x8 * MatrixY8x8
1066 cycles, MultiplyMxM_KxK_v1FPU, MatrixX8x8 * MatrixY8x8
476 cycles, MultiplyMxM_KxK_v1SSE, MatrixX11x11 * MatrixY11x11
2692 cycles, MultiplyMxM_KxK_v1FPU, MatrixX11x11 * MatrixY11x11
490 cycles, MultiplyMxM_KxK_v1SSE, MatrixX10x10 * MatrixY10x10
2086 cycles, MultiplyMxM_KxK_v1FPU, MatrixX10x10 * MatrixY10x10
553 cycles, MultiplyMxM_KxK_v1SSE, MatrixX12x12 * MatrixY12x12
3375 cycles, MultiplyMxM_KxK_v1FPU, MatrixX12x12 * MatrixY12x12
921 cycles, MultiplyMxM_KxK_v1SSE, MatrixX13x13 * MatrixY13x13
4261 cycles, MultiplyMxM_KxK_v1FPU, MatrixX13x13 * MatrixY13x13
1208 cycles, MultiplyMxM_KxK_v1SSE, MatrixX16x16 * MatrixY16x16
7441 cycles, MultiplyMxM_KxK_v1FPU, MatrixX16x16 * MatrixY16x16
1270 cycles, MultiplyMxM_KxK_v1SSE, MatrixX14x14 * MatrixY14x14
5142 cycles, MultiplyMxM_KxK_v1FPU, MatrixX14x14 * MatrixY14x14
1341 cycles, MultiplyMxM_KxK_v1SSE, MatrixX15x15 * MatrixY15x15
6201 cycles, MultiplyMxM_KxK_v1FPU, MatrixX15x15 * MatrixY15x15
2080 cycles, MultiplyMxM_KxK_v1SSE, MatrixX17x17 * MatrixY17x17
8869 cycles, MultiplyMxM_KxK_v1FPU, MatrixX17x17 * MatrixY17x17
2348 cycles, MultiplyMxM_KxK_v1SSE, MatrixX20x20 * MatrixY20x20
14253 cycles, MultiplyMxM_KxK_v1FPU, MatrixX20x20 * MatrixY20x20
2531 cycles, MultiplyMxM_KxK_v1SSE, MatrixX18x18 * MatrixY18x18
10382 cycles, MultiplyMxM_KxK_v1FPU, MatrixX18x18 * MatrixY18x18
2542 cycles, MultiplyMxM_KxK_v1SSE, MatrixX19x19 * MatrixY19x19
12381 cycles, MultiplyMxM_KxK_v1FPU, MatrixX19x19 * MatrixY19x19
9494 cycles, MultiplyMxM_KxK_v1SSE, MatrixX32x32 * MatrixY32x32
60893 cycles, MultiplyMxM_KxK_v1FPU, MatrixX32x32 * MatrixY32x32
LiaoMi:
***** Time table - LoopCount =100 000 *****
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)
39 cycles, MultiplyMxM_KxK_v1SSE, MatrixX4x4 * MatrixY4x4
291 cycles, MultiplyMxM_KxK_v1FPU, MatrixX4x4 * MatrixY4x4
164 cycles, MultiplyMxM_KxK_v1SSE, MatrixX8x8 * MatrixY8x8
1460 cycles, MultiplyMxM_KxK_v1FPU, MatrixX8x8 * MatrixY8x8
494 cycles, MultiplyMxM_KxK_v1SSE, MatrixX10x10 * MatrixY10x10
2790 cycles, MultiplyMxM_KxK_v1FPU, MatrixX10x10 * MatrixY10x10
538 cycles, MultiplyMxM_KxK_v1SSE, MatrixX11x11 * MatrixY11x11
3559 cycles, MultiplyMxM_KxK_v1FPU, MatrixX11x11 * MatrixY11x11
569 cycles, MultiplyMxM_KxK_v1SSE, MatrixX12x12 * MatrixY12x12
4586 cycles, MultiplyMxM_KxK_v1FPU, MatrixX12x12 * MatrixY12x12
976 cycles, MultiplyMxM_KxK_v1SSE, MatrixX13x13 * MatrixY13x13
5687 cycles, MultiplyMxM_KxK_v1FPU, MatrixX13x13 * MatrixY13x13
1252 cycles, MultiplyMxM_KxK_v1SSE, MatrixX15x15 * MatrixY15x15
8492 cycles, MultiplyMxM_KxK_v1FPU, MatrixX15x15 * MatrixY15x15
1304 cycles, MultiplyMxM_KxK_v1SSE, MatrixX16x16 * MatrixY16x16
10877 cycles, MultiplyMxM_KxK_v1FPU, MatrixX16x16 * MatrixY16x16
1366 cycles, MultiplyMxM_KxK_v1SSE, MatrixX14x14 * MatrixY14x14
6774 cycles, MultiplyMxM_KxK_v1FPU, MatrixX14x14 * MatrixY14x14
2048 cycles, MultiplyMxM_KxK_v1SSE, MatrixX17x17 * MatrixY17x17
13001 cycles, MultiplyMxM_KxK_v1FPU, MatrixX17x17 * MatrixY17x17
2446 cycles, MultiplyMxM_KxK_v1SSE, MatrixX19x19 * MatrixY19x19
19182 cycles, MultiplyMxM_KxK_v1FPU, MatrixX19x19 * MatrixY19x19
2562 cycles, MultiplyMxM_KxK_v1SSE, MatrixX20x20 * MatrixY20x20
22599 cycles, MultiplyMxM_KxK_v1FPU, MatrixX20x20 * MatrixY20x20
2691 cycles, MultiplyMxM_KxK_v1SSE, MatrixX18x18 * MatrixY18x18
15169 cycles, MultiplyMxM_KxK_v1FPU, MatrixX18x18 * MatrixY18x18
10404 cycles, MultiplyMxM_KxK_v1SSE, MatrixX32x32 * MatrixY32x32
99621 cycles, MultiplyMxM_KxK_v1FPU, MatrixX32x32 * MatrixY32x32
EDIT: please, replace the file StringTable... because the name of the procedure in the test results is not ...MxM_MxM but ...MxM_KxK
The results,
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)