Multiply matrix MxM by KxK real4 any size
August 17, 2018, 03:44:39 AM
Hi all,
Here we have 1 version to multiply any square matrix MxM by
any square matrix KxK (M=K) using SSE instructions and 1 that uses FPU.

With this work we may solve all cases as we want:

invoke      MultiplyMxN_NxK_v6SSE, pMatX, pMatY, pMatXY     ; general case
invoke      MultiplyMxM_MxM_v1SSE, pMatX, pMatXY              ; particular case Y=X
invoke      MultiplyMxM_KxK_v1SSE, pMatX, pMatY, pMatXY     ; particular case Y<>X

note: each matrix used has a dimension behind the address and it seems to be the best
way to test the procedures. If you want modify the procedure and pass the dimensions
of each matrix. Note also that each procedure doesnt call any other procedure
inside, so the code is large or very large especially MultiplyMxN_NxK_v6SSE

Quote
VERSION 1:
PROCEDURE:  MultiplyMxM_KxK_v1SSE

FILE:           multiplySSEMxM_KxK_v1.inc

MACROS:     multiplyMxN_KxK_v2A.mac
multiplyMxN_KxK_v2B.mac
basicmulMxN_KxK_v1.mac

VERSION FPU:
PROCEDURE:  MultiplyMxM_KxK_v1FPU

FILE:              multiplyFPUMxM_KxK_v1.inc

MATRIX DEFINITION:      We must define any matrixX as this

ALIGN 16
dd ?
dd ?
dd M   ; <<--- number of columns
dd M   ; <<--- number of lines
matrixX  dd (M*M) dup (?)

If we want to alloc memory, see the file AllocMemory.inc

VERIFY SSE PROCEDURES:  Use multiplyMxM_KxK_v1.exe/asm

Use ExecuteTestmultiplyMxM_KxK_SSEv1.bat and post the file ResultsmultiplyMxM_KxK_v1.txt.

Good luck
RuiLoureiro

Some results:
Quote
***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

50  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
319  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4

264  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
2031  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8

745  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
3512  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10

769  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
5827  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12

891  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
4573  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11

1402  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
7312  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13

1792  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
9026  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14

1806  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
13263  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16

1980  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
11000  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15

3343  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
15792  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17

3410  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20  ; <<< align 16 effect
25353  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20

3690  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
18647  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18

4119  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
21822  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19

14019  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32  ; 13.8% ...
101255  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32  ; 7.2 * 14019
Code: [Select]

LiaoMi
***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
229  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4

177  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
1066  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8

476  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
2692  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11

490  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
2086  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10

553  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
3375  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12

921  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
4261  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13

1208  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
7441  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16

1270  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
5142  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14

1341  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
6201  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15

2080  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
8869  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17

2348  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
14253  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20

2531  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
10382  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18

2542  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
12381  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19

9494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
60893  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32

LiaoMi:
***** Time table - LoopCount =100 000 *****

AMD Ryzen 7 1700 Eight-Core Processor           (SSE4)

39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
291  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4

164  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
1460  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8

494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
2790  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10

538  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
3559  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11

569  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
4586  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12

976  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
5687  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13

1252  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
8492  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15

1304  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
10877  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16

1366  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
6774  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14

2048  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
13001  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17

2446  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
19182  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19

2562  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
22599  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20

2691  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
15169  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18

10404  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
99621  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32
EDIT: please, replace the file StringTable... because the name of the procedure in the test results is not ...MxM_MxM but ...MxM_KxK
« Last Edit: August 18, 2018, 12:18:30 AM by RuiLoureiro »

Re: Multiply matrix MxM by KxK real4 any size
Reply #1 on: August 17, 2018, 05:09:34 AM
The results,
Re: Multiply matrix MxM by KxK real4 any size
Reply #2 on: August 17, 2018, 09:01:19 PM
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

Re: Multiply matrix MxM by KxK real4 any size
Reply #3 on: August 17, 2018, 09:26:16 PM
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)