News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multiply matrix MxM by KxK real4 any size

Started by RuiLoureiro, August 17, 2018, 03:44:39 AM

Previous topic - Next topic

RuiLoureiro

Hi all,
       Here we have 1 version to multiply any square matrix MxM by
       any square matrix KxK (M=K) using SSE instructions and 1 that uses FPU.

       With this work we may solve all cases as we want:

            invoke      MultiplyMxN_NxK_v6SSE, pMatX, pMatY, pMatXY     ; general case
            invoke      MultiplyMxM_MxM_v1SSE, pMatX, pMatXY              ; particular case Y=X
            invoke      MultiplyMxM_KxK_v1SSE, pMatX, pMatY, pMatXY     ; particular case Y<>X

note: each matrix used has a dimension behind the address and it seems to be the best
         way to test the procedures. If you want modify the procedure and pass the dimensions
         of each matrix. Note also that each procedure doesnt call any other procedure
         inside, so the code is large or very large especially MultiplyMxN_NxK_v6SSE

Quote
      VERSION 1:
                PROCEDURE:  MultiplyMxM_KxK_v1SSE
               
                FILE:           multiplySSEMxM_KxK_v1.inc
               
                MACROS:     multiplyMxN_KxK_v2A.mac
                                  multiplyMxN_KxK_v2B.mac
                                  basicmulMxN_KxK_v1.mac

      VERSION FPU:
                PROCEDURE:  MultiplyMxM_KxK_v1FPU
               
                FILE:              multiplyFPUMxM_KxK_v1.inc


    DOCUMENTATION:          TEXT_ABOUT_MULTIPLY_SSE_REAL4.txt

    MATRIX DEFINITION:      We must define any matrixX as this

                            ALIGN 16
                            dd ?
                            dd ?
                            dd M   ; <<--- number of columns
                            dd M   ; <<--- number of lines
              matrixX  dd (M*M) dup (?)         

                            If we want to alloc memory, see the file AllocMemory.inc

    VERIFY SSE PROCEDURES:  Use multiplyMxM_KxK_v1.exe/asm

    Please test it in your CPU (i5/i7/AMD).
    Use ExecuteTestmultiplyMxM_KxK_SSEv1.bat and post the file ResultsmultiplyMxM_KxK_v1.txt.

Good luck
RuiLoureiro

Some results: :t
Quote
***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

    50  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
   319  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
   264  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
  2031  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
   
   745  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
  3512  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
   
   769  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
  5827  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
   
   891  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
  4573  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
   
  1402  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
  7312  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
  1792  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
  9026  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14
 
  1806  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
13263  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16
 
  1980  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
11000  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15
 
  3343  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
15792  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17
 
  3410  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20  ; <<< align 16 effect
25353  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20
 
  3690  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
18647  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18
 
  4119  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
21822  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19
 
  14019  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32  ; 13.8% ...
101255  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32  ; 7.2 * 14019

LiaoMi
***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

   39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
  229  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
  177  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
1066  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
 
  476  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
2692  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
 
  490  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
2086  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
 
  553  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
3375  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
 
  921  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
4261  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
1208  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
7441  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16

1270  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
5142  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14

1341  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
6201  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15

2080  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
8869  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17

2348  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
14253  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20

2531  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
10382  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18

2542  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
12381  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19

9494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
60893  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32

LiaoMi:
***** Time table - LoopCount =100 000 *****

AMD Ryzen 7 1700 Eight-Core Processor           (SSE4)

   39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
  291  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
  164  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
1460  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
 
  494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
2790  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
 
  538  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
3559  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
 
  569  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
4586  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
 
  976  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
5687  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
1252  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
8492  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15

1304  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
10877  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16

1366  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
6774  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14

2048  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
13001  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17

2446  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
19182  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19

2562  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
22599  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20

2691  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
15169  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18

10404  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
99621  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32

EDIT: please, replace the file StringTable... because the name of the procedure in the test results is not ...MxM_MxM but ...MxM_KxK

Siekmanski

Creative coders use backward thinking techniques as a strategy.

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

LiaoMi

AMD Ryzen 7 1700 Eight-Core Processor (SSE4)