Hi all Here we have 1 version to multiply any matrix MxM by the same matrix MxM using SSE instructions and 1 that uses FPU. This is a particular case of MxN * NxK already posted here, and a particular case of MxM * K*K also posted here.
If we need to multiply a matrix by itself this is the better solution than to use the general case.Quote
VERSION 1:
PROCEDURE: MultiplyMxM_MxM_v1SSE
FILE: multiplySSEMxM_MxM_v1.inc
MACROS: multiplyMxM_MxM_v2A.mac
multiplyMxM_MxM_v2B.mac
basicmulMxM_MxM_v1.mac
VERSION FPU:
PROCEDURE: MultiplyMxM_MxM_v1FPU
FILE: multiplyFPUMxM_MxM_v1.inc
DOCUMENTATION: TEXT_ABOUT_MULTIPLY_SSE_REAL4.txt
MATRIX DEFINITION: We must define any matrixX as this
ALIGN 16
dd ?
dd ?
dd M ; <<--- number of columns
dd M ; <<--- number of lines
matrixX dd (M*M) dup (?)
If we want to alloc memory, see the file AllocMemory.inc
VERIFY SSE PROCEDURE: Use multiplyMxM_MxM_v1.exe/asm
Please test it in your CPU (i5/i7/AMD).
Use ExecuteTestmultiplyMxM_MxM_SSEv1.bat and
post the file ResultsmultiplyMxM_MxM_v1.txt.
Good luckRuiLoureiroSome results: :t
Quote
***** Time table - LoopCount =100 000 *****
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
52 cycles, MultiplyMxM_MxM_v1SSE, MatrixX4x4 * MatrixX4x4
381 cycles, MultiplyMxM_MxM_v1FPU, MatrixX4x4 * MatrixX4x4
258 cycles, MultiplyMxM_MxM_v1SSE, MatrixX8x8 * MatrixX8x8
1902 cycles, MultiplyMxM_MxM_v1FPU, MatrixX8x8 * MatrixX8x8
763 cycles, MultiplyMxM_MxM_v1SSE, MatrixX10x10 * MatrixX10x10
4157 cycles, MultiplyMxM_MxM_v1FPU, MatrixX10x10 * MatrixX10x10
768 cycles, MultiplyMxM_MxM_v1SSE, MatrixX12x12 * MatrixX12x12
5797 cycles, MultiplyMxM_MxM_v1FPU, MatrixX12x12 * MatrixX12x12
991 cycles, MultiplyMxM_MxM_v1SSE, MatrixX11x11 * MatrixX11x11
4606 cycles, MultiplyMxM_MxM_v1FPU, MatrixX11x11 * MatrixX11x11
1371 cycles, MultiplyMxM_MxM_v1SSE, MatrixX13x13 * MatrixX13x13
7289 cycles, MultiplyMxM_MxM_v1FPU, MatrixX13x13 * MatrixX13x13
1895 cycles, MultiplyMxM_MxM_v1SSE, MatrixX16x16 * MatrixX16x16
14628 cycles, MultiplyMxM_MxM_v1FPU, MatrixX16x16 * MatrixX16x16
2217 cycles, MultiplyMxM_MxM_v1SSE, MatrixX15x15 * MatrixX15x15
11565 cycles, MultiplyMxM_MxM_v1FPU, MatrixX15x15 * MatrixX15x15
2412 cycles, MultiplyMxM_MxM_v1SSE, MatrixX14x14 * MatrixX14x14
9141 cycles, MultiplyMxM_MxM_v1FPU, MatrixX14x14 * MatrixX14x14
3080 cycles, MultiplyMxM_MxM_v1SSE, MatrixX17x17 * MatrixX17x17
16757 cycles, MultiplyMxM_MxM_v1FPU, MatrixX17x17 * MatrixX17x17
3650 cycles, MultiplyMxM_MxM_v1SSE, MatrixX20x20 * MatrixX20x20
26873 cycles, MultiplyMxM_MxM_v1FPU, MatrixX20x20 * MatrixX20x20
3823 cycles, MultiplyMxM_MxM_v1SSE, MatrixX18x18 * MatrixX18x18
20278 cycles, MultiplyMxM_MxM_v1FPU, MatrixX18x18 * MatrixX18x18
4174 cycles, MultiplyMxM_MxM_v1SSE, MatrixX19x19 * MatrixX19x19
23342 cycles, MultiplyMxM_MxM_v1FPU, MatrixX19x19 * MatrixX19x19
14668 cycles, MultiplyMxM_MxM_v1SSE, MatrixX32x32 * MatrixX32x32
108478 cycles, MultiplyMxM_MxM_v1FPU, MatrixX32x32 * MatrixX32x32
LiaoMi:
***** Time table - LoopCount =100 000 *****
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
36 cycles, MultiplyMxM_MxM_v1SSE, MatrixX4x4 * MatrixX4x4
224 cycles, MultiplyMxM_MxM_v1FPU, MatrixX4x4 * MatrixX4x4
178 cycles, MultiplyMxM_MxM_v1SSE, MatrixX8x8 * MatrixX8x8
1038 cycles, MultiplyMxM_MxM_v1FPU, MatrixX8x8 * MatrixX8x8
481 cycles, MultiplyMxM_MxM_v1SSE, MatrixX10x10 * MatrixX10x10
2024 cycles, MultiplyMxM_MxM_v1FPU, MatrixX10x10 * MatrixX10x10
481 cycles, MultiplyMxM_MxM_v1SSE, MatrixX11x11 * MatrixX11x11
2623 cycles, MultiplyMxM_MxM_v1FPU, MatrixX11x11 * MatrixX11x11
524 cycles, MultiplyMxM_MxM_v1SSE, MatrixX12x12 * MatrixX12x12
3801 cycles, MultiplyMxM_MxM_v1FPU, MatrixX12x12 * MatrixX12x12
1024 cycles, MultiplyMxM_MxM_v1SSE, MatrixX13x13 * MatrixX13x13
4576 cycles, MultiplyMxM_MxM_v1FPU, MatrixX13x13 * MatrixX13x13
1120 cycles, MultiplyMxM_MxM_v1SSE, MatrixX15x15 * MatrixX15x15
6321 cycles, MultiplyMxM_MxM_v1FPU, MatrixX15x15 * MatrixX15x15
1178 cycles, MultiplyMxM_MxM_v1SSE, MatrixX14x14 * MatrixX14x14
5293 cycles, MultiplyMxM_MxM_v1FPU, MatrixX14x14 * MatrixX14x14
1350 cycles, MultiplyMxM_MxM_v1SSE, MatrixX16x16 * MatrixX16x16
7535 cycles, MultiplyMxM_MxM_v1FPU, MatrixX16x16 * MatrixX16x16
2062 cycles, MultiplyMxM_MxM_v1SSE, MatrixX17x17 * MatrixX17x17
8843 cycles, MultiplyMxM_MxM_v1FPU, MatrixX17x17 * MatrixX17x17
2214 cycles, MultiplyMxM_MxM_v1SSE, MatrixX19x19 * MatrixX19x19
11990 cycles, MultiplyMxM_MxM_v1FPU, MatrixX19x19 * MatrixX19x19
2385 cycles, MultiplyMxM_MxM_v1SSE, MatrixX20x20 * MatrixX20x20
14333 cycles, MultiplyMxM_MxM_v1FPU, MatrixX20x20 * MatrixX20x20
2855 cycles, MultiplyMxM_MxM_v1SSE, MatrixX18x18 * MatrixX18x18
10295 cycles, MultiplyMxM_MxM_v1FPU, MatrixX18x18 * MatrixX18x18
9592 cycles, MultiplyMxM_MxM_v1SSE, MatrixX32x32 * MatrixX32x32 ; 15%
62165 cycles, MultiplyMxM_MxM_v1FPU, MatrixX32x32 * MatrixX32x32
LiaoMi:
***** Time table - LoopCount =100 000 *****
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)
37 cycles, MultiplyMxM_MxM_v1SSE, MatrixX4x4 * MatrixX4x4
547 cycles, MultiplyMxM_MxM_v1FPU, MatrixX4x4 * MatrixX4x4
171 cycles, MultiplyMxM_MxM_v1SSE, MatrixX8x8 * MatrixX8x8
1317 cycles, MultiplyMxM_MxM_v1FPU, MatrixX8x8 * MatrixX8x8
509 cycles, MultiplyMxM_MxM_v1SSE, MatrixX10x10 * MatrixX10x10
2489 cycles, MultiplyMxM_MxM_v1FPU, MatrixX10x10 * MatrixX10x10
542 cycles, MultiplyMxM_MxM_v1SSE, MatrixX12x12 * MatrixX12x12
4233 cycles, MultiplyMxM_MxM_v1FPU, MatrixX12x12 * MatrixX12x12
544 cycles, MultiplyMxM_MxM_v1SSE, MatrixX11x11 * MatrixX11x11
3222 cycles, MultiplyMxM_MxM_v1FPU, MatrixX11x11 * MatrixX11x11
863 cycles, MultiplyMxM_MxM_v1SSE, MatrixX13x13 * MatrixX13x13
5213 cycles, MultiplyMxM_MxM_v1FPU, MatrixX13x13 * MatrixX13x13
1102 cycles, MultiplyMxM_MxM_v1SSE, MatrixX15x15 * MatrixX15x15
8103 cycles, MultiplyMxM_MxM_v1FPU, MatrixX15x15 * MatrixX15x15
1224 cycles, MultiplyMxM_MxM_v1SSE, MatrixX14x14 * MatrixX14x14
6523 cycles, MultiplyMxM_MxM_v1FPU, MatrixX14x14 * MatrixX14x14
1251 cycles, MultiplyMxM_MxM_v1SSE, MatrixX16x16 * MatrixX16x16
9801 cycles, MultiplyMxM_MxM_v1FPU, MatrixX16x16 * MatrixX16x16
1908 cycles, MultiplyMxM_MxM_v1SSE, MatrixX17x17 * MatrixX17x17
11814 cycles, MultiplyMxM_MxM_v1FPU, MatrixX17x17 * MatrixX17x17
2188 cycles, MultiplyMxM_MxM_v1SSE, MatrixX19x19 * MatrixX19x19
17601 cycles, MultiplyMxM_MxM_v1FPU, MatrixX19x19 * MatrixX19x19
2382 cycles, MultiplyMxM_MxM_v1SSE, MatrixX20x20 * MatrixX20x20
21288 cycles, MultiplyMxM_MxM_v1FPU, MatrixX20x20 * MatrixX20x20
2462 cycles, MultiplyMxM_MxM_v1SSE, MatrixX18x18 * MatrixX18x18
14464 cycles, MultiplyMxM_MxM_v1FPU, MatrixX18x18 * MatrixX18x18
9465 cycles, MultiplyMxM_MxM_v1SSE, MatrixX32x32 * MatrixX32x32 ; 10%
90823 cycles, MultiplyMxM_MxM_v1FPU, MatrixX32x32 * MatrixX32x32
The results.
Hi RuiLoureiro,
AMD Ryzen 7 1700 Eight-Core Processor
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)