Author Topic: Multiply matrix MxM by KxK real4 any size  (Read 330 times)

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Multiply matrix MxM by KxK real4 any size
« on: August 17, 2018, 03:44:39 AM »
Hi all,
       Here we have 1 version to multiply any square matrix MxM by
       any square matrix KxK (M=K) using SSE instructions and 1 that uses FPU.

       With this work we may solve all cases as we want:

            invoke      MultiplyMxN_NxK_v6SSE, pMatX, pMatY, pMatXY     ; general case
            invoke      MultiplyMxM_MxM_v1SSE, pMatX, pMatXY              ; particular case Y=X
            invoke      MultiplyMxM_KxK_v1SSE, pMatX, pMatY, pMatXY     ; particular case Y<>X

note: each matrix used has a dimension behind the address and it seems to be the best
         way to test the procedures. If you want modify the procedure and pass the dimensions
         of each matrix. Note also that each procedure doesnt call any other procedure
         inside, so the code is large or very large especially MultiplyMxN_NxK_v6SSE

Quote
      VERSION 1:
                PROCEDURE:  MultiplyMxM_KxK_v1SSE
               
                FILE:           multiplySSEMxM_KxK_v1.inc
               
                MACROS:     multiplyMxN_KxK_v2A.mac
                                  multiplyMxN_KxK_v2B.mac
                                  basicmulMxN_KxK_v1.mac

      VERSION FPU:
                PROCEDURE:  MultiplyMxM_KxK_v1FPU
               
                FILE:              multiplyFPUMxM_KxK_v1.inc


    DOCUMENTATION:          TEXT_ABOUT_MULTIPLY_SSE_REAL4.txt

    MATRIX DEFINITION:      We must define any matrixX as this

                            ALIGN 16
                            dd ?
                            dd ?
                            dd M   ; <<--- number of columns
                            dd M   ; <<--- number of lines
              matrixX  dd (M*M) dup (?)         

                            If we want to alloc memory, see the file AllocMemory.inc

    VERIFY SSE PROCEDURES:  Use multiplyMxM_KxK_v1.exe/asm

    Please test it in your CPU (i5/i7/AMD).
    Use ExecuteTestmultiplyMxM_KxK_SSEv1.bat and post the file ResultsmultiplyMxM_KxK_v1.txt.

Good luck
RuiLoureiro

Some results: :t
Quote
***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

    50  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
   319  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
   264  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
  2031  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
   
   745  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
  3512  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
   
   769  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
  5827  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
   
   891  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
  4573  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
   
  1402  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
  7312  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
  1792  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
  9026  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14
 
  1806  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
13263  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16
 
  1980  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
11000  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15
 
  3343  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
15792  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17
 
  3410  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20  ; <<< align 16 effect
25353  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20
 
  3690  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
18647  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18
 
  4119  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
21822  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19
 
  14019  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32  ; 13.8% ...
101255  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32  ; 7.2 * 14019
Code: [Select]

LiaoMi
 ***** Time table - LoopCount =100 000 *****

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
 
   39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
  229  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
  177  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
 1066  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
 
  476  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
 2692  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
 
  490  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
 2086  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
 
  553  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
 3375  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
 
  921  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
 4261  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
 1208  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
 7441  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16
 
 1270  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
 5142  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14
 
 1341  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
 6201  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15
 
 2080  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
 8869  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17
 
 2348  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
14253  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20
 
 2531  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
10382  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18
 
 2542  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
12381  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19
 
 9494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
60893  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32

LiaoMi:
 ***** Time table - LoopCount =100 000 *****

AMD Ryzen 7 1700 Eight-Core Processor           (SSE4)
 
   39  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX4x4   * MatrixY4x4
  291  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX4x4   * MatrixY4x4
   
  164  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX8x8   * MatrixY8x8
 1460  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX8x8   * MatrixY8x8
 
  494  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX10x10 * MatrixY10x10
 2790  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX10x10 * MatrixY10x10
 
  538  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX11x11 * MatrixY11x11
 3559  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX11x11 * MatrixY11x11
 
  569  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX12x12 * MatrixY12x12
 4586  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX12x12 * MatrixY12x12
 
  976  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX13x13 * MatrixY13x13
 5687  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX13x13 * MatrixY13x13
 
 1252  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX15x15 * MatrixY15x15
 8492  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX15x15 * MatrixY15x15
 
 1304  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX16x16 * MatrixY16x16
10877  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX16x16 * MatrixY16x16
 
 1366  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX14x14 * MatrixY14x14
 6774  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX14x14 * MatrixY14x14
 
 2048  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX17x17 * MatrixY17x17
13001  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX17x17 * MatrixY17x17
 
 2446  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX19x19 * MatrixY19x19
19182  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX19x19 * MatrixY19x19
 
 2562  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX20x20 * MatrixY20x20
22599  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX20x20 * MatrixY20x20
 
 2691  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX18x18 * MatrixY18x18
15169  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX18x18 * MatrixY18x18
 
10404  cycles, MultiplyMxM_KxK_v1SSE,  MatrixX32x32 * MatrixY32x32
99621  cycles, MultiplyMxM_KxK_v1FPU,  MatrixX32x32 * MatrixY32x32
EDIT: please, replace the file StringTable... because the name of the procedure in the test results is not ...MxM_MxM but ...MxM_KxK
« Last Edit: August 18, 2018, 12:18:30 AM by RuiLoureiro »

Siekmanski

  • Member
  • *****
  • Posts: 1684
Re: Multiply matrix MxM by KxK real4 any size
« Reply #1 on: August 17, 2018, 05:09:34 AM »
The results,
Creative coders use backward thinking techniques as a strategy.

LiaoMi

  • Member
  • ***
  • Posts: 324
Re: Multiply matrix MxM by KxK real4 any size
« Reply #2 on: August 17, 2018, 09:01:19 PM »
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

LiaoMi

  • Member
  • ***
  • Posts: 324
Re: Multiply matrix MxM by KxK real4 any size
« Reply #3 on: August 17, 2018, 09:26:16 PM »
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)