Hi all, Here we have 3 particular cases to multiply any vector 1x4, 1x6 or 1x8 by the square matrix 4x4, 6x6 and 8x8 using SSE/FPU instructions. Now, using these examples, you may write any other particular case.Quote
The procedures are inside the files .inc
By lines:
invoke Multiply1x4_4x4Lin_v1SSE, pMatX, pMatY, pMatXY
invoke Multiply1x4_4x4Lin_v2SSE, pMatX, pMatY, pMatXY
invoke Multiply1x4_4x4Lin_v3SSE, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v2SSE, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v3SSE, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v1SSE, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v2SSE, pMatX, pMatY, pMatXY
By columns(by definition):
invoke Multiply1x4_4x4Col_v1SSE, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Col_v1SSE, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Col_v1SSE, pMatX, pMatY, pMatXY
FPU VERSIONS:
invoke Multiply1x4_4x4Lin_v1FPU, pMatX, pMatY, pMatXY
invoke Multiply1x4_4x4Lin_v2FPU, pMatX, pMatY, pMatXY
invoke Multiply1x4_4x4Lin_v3FPU, pMatX, pMatY, pMatXY
invoke Multiply1x4_4x4Lin_v4FPU, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v1FPU, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v2FPU, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v3FPU, pMatX, pMatY, pMatXY
invoke Multiply1x6_6x6Lin_v4FPU, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v1FPU, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v2FPU, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v3FPU, pMatX, pMatY, pMatXY
invoke Multiply1x8_8x8Lin_v4FPU, pMatX, pMatY, pMatXY
The same for culumns
DOCUMENTATION: TEXT_ABOUT_MULTIPLY_SSE_REAL4.txt
MATRIX DEFINITION: We must define any matrixX as this
ALIGN 16
dd ?
dd ?
dd M ; <<--- number of columns
dd M ; <<--- number of lines
matrixX dd (M*M) dup (?)
VERIFY SSE PROCEDURES: Use multiply1xM_MxMLin_v1.exe/asm and
multiply1xM_MxMCol_v1.exe/asm
Please test it in your CPU (i5/i7/AMD).
Use ExecuteTestmultiply1xM_MxM_SSEv1.bat and post the file Resultsmultiply1xM_MxM_v1.txt.
Good luckRuiLoureiroAll results: :t :t :t
Thanks Siekmanski (
for all your help ;)
),LiaoMi (
gaves me a lot of work and you run it in 4 cycles-unfair ;)
),J
ochen (8 cycles is very good for i5 ! ;)
)
Siekmanski:
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
18 cycles, Multiply1x4_4x4Lin_v3SSE, MatrixX1x4 * MatrixY4x4
18 cycles, Multiply1x4_4x4Lin_v1SSE, MatrixX1x4 * MatrixY4x4
22 cycles, Multiply1x4_4x4Lin_v2SSE, MatrixX1x4 * MatrixY4x4
42 cycles, Multiply1x6_6x6Lin_v2SSE, MatrixX1x6 * MatrixY6x6
45 cycles, Multiply1x6_6x6Lin_v3SSE, MatrixX1x6 * MatrixY6x6
69 cycles, Multiply1x8_8x8Lin_v1SSE, MatrixX1x8 * MatrixY8x8
76 cycles, Multiply1x8_8x8Lin_v2SSE, MatrixX1x8 * MatrixY8x8
115 cycles, Multiply1x4_4x4Lin_v1FPU, MatrixX1x4 * MatrixY4x4
117 cycles, Multiply1x4_4x4Lin_v4FPU, MatrixX1x4 * MatrixY4x4
118 cycles, Multiply1x4_4x4Lin_v2FPU, MatrixX1x4 * MatrixY4x4
119 cycles, Multiply1x4_4x4Lin_v3FPU, MatrixX1x4 * MatrixY4x4
153 cycles, Multiply1x6_6x6Lin_v2FPU, MatrixX1x6 * MatrixY6x6
157 cycles, Multiply1x6_6x6Lin_v1FPU, MatrixX1x6 * MatrixY6x6
160 cycles, Multiply1x6_6x6Lin_v4FPU, MatrixX1x6 * MatrixY6x6
166 cycles, Multiply1x6_6x6Lin_v3FPU, MatrixX1x6 * MatrixY6x6
177 cycles, Multiply1x8_8x8Lin_v1FPU, MatrixX1x8 * MatrixY8x8
182 cycles, Multiply1x8_8x8Lin_v4FPU, MatrixX1x8 * MatrixY8x8
206 cycles, Multiply1x8_8x8Lin_v2FPU, MatrixX1x8 * MatrixY8x8
215 cycles, Multiply1x8_8x8Lin_v3FPU, MatrixX1x8 * MatrixY8x8
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
9 cycles, Multiply1x4_4x4Col_v1SSE, MatrixX1x4 * MatrixY4x4
25 cycles, Multiply1x6_6x6Col_v1SSE, MatrixX1x6 * MatrixY6x6
30 cycles, Multiply1x8_8x8Col_v1SSE, MatrixX1x8 * MatrixY8x8
114 cycles, Multiply1x4_4x4Col_v2FPU, MatrixX1x4 * MatrixY4x4
116 cycles, Multiply1x4_4x4Col_v1FPU, MatrixX1x4 * MatrixY4x4
117 cycles, Multiply1x4_4x4Col_v3FPU, MatrixX1x4 * MatrixY4x4
120 cycles, Multiply1x4_4x4Col_v4FPU, MatrixX1x4 * MatrixY4x4
152 cycles, Multiply1x6_6x6Col_v2FPU, MatrixX1x6 * MatrixY6x6
154 cycles, Multiply1x6_6x6Col_v1FPU, MatrixX1x6 * MatrixY6x6
163 cycles, Multiply1x6_6x6Col_v3FPU, MatrixX1x6 * MatrixY6x6
165 cycles, Multiply1x6_6x6Col_v4FPU, MatrixX1x6 * MatrixY6x6
194 cycles, Multiply1x8_8x8Col_v2FPU, MatrixX1x8 * MatrixY8x8
202 cycles, Multiply1x8_8x8Col_v1FPU, MatrixX1x8 * MatrixY8x8
204 cycles, Multiply1x8_8x8Col_v3FPU, MatrixX1x8 * MatrixY8x8
206 cycles, Multiply1x8_8x8Col_v4FPU, MatrixX1x8 * MatrixY8x8
+++++++++++++++++++++++++++++++++++++++++++
LiaoMi:
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
13 cycles, Multiply1x4_4x4Lin_v1SSE, MatrixX1x4 * MatrixY4x4
14 cycles, Multiply1x4_4x4Lin_v3SSE, MatrixX1x4 * MatrixY4x4
16 cycles, Multiply1x4_4x4Lin_v2SSE, MatrixX1x4 * MatrixY4x4
30 cycles, Multiply1x6_6x6Lin_v2SSE, MatrixX1x6 * MatrixY6x6
33 cycles, Multiply1x6_6x6Lin_v3SSE, MatrixX1x6 * MatrixY6x6
47 cycles, Multiply1x8_8x8Lin_v1SSE, MatrixX1x8 * MatrixY8x8
50 cycles, Multiply1x8_8x8Lin_v2SSE, MatrixX1x8 * MatrixY8x8
86 cycles, Multiply1x4_4x4Lin_v1FPU, MatrixX1x4 * MatrixY4x4
89 cycles, Multiply1x4_4x4Lin_v2FPU, MatrixX1x4 * MatrixY4x4
90 cycles, Multiply1x4_4x4Lin_v4FPU, MatrixX1x4 * MatrixY4x4
94 cycles, Multiply1x4_4x4Lin_v3FPU, MatrixX1x4 * MatrixY4x4
106 cycles, Multiply1x6_6x6Lin_v2FPU, MatrixX1x6 * MatrixY6x6
108 cycles, Multiply1x6_6x6Lin_v4FPU, MatrixX1x6 * MatrixY6x6
111 cycles, Multiply1x6_6x6Lin_v1FPU, MatrixX1x6 * MatrixY6x6
114 cycles, Multiply1x6_6x6Lin_v3FPU, MatrixX1x6 * MatrixY6x6
120 cycles, Multiply1x8_8x8Lin_v1FPU, MatrixX1x8 * MatrixY8x8
121 cycles, Multiply1x8_8x8Lin_v4FPU, MatrixX1x8 * MatrixY8x8
140 cycles, Multiply1x8_8x8Lin_v3FPU, MatrixX1x8 * MatrixY8x8
145 cycles, Multiply1x8_8x8Lin_v2FPU, MatrixX1x8 * MatrixY8x8
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
7 cycles, Multiply1x4_4x4Col_v1SSE, MatrixX1x4 * MatrixY4x4
16 cycles, Multiply1x6_6x6Col_v1SSE, MatrixX1x6 * MatrixY6x6
21 cycles, Multiply1x8_8x8Col_v1SSE, MatrixX1x8 * MatrixY8x8
86 cycles, Multiply1x4_4x4Col_v2FPU, MatrixX1x4 * MatrixY4x4
86 cycles, Multiply1x4_4x4Col_v1FPU, MatrixX1x4 * MatrixY4x4
88 cycles, Multiply1x4_4x4Col_v3FPU, MatrixX1x4 * MatrixY4x4
91 cycles, Multiply1x4_4x4Col_v4FPU, MatrixX1x4 * MatrixY4x4
105 cycles, Multiply1x6_6x6Col_v3FPU, MatrixX1x6 * MatrixY6x6
105 cycles, Multiply1x6_6x6Col_v2FPU, MatrixX1x6 * MatrixY6x6
108 cycles, Multiply1x6_6x6Col_v1FPU, MatrixX1x6 * MatrixY6x6
113 cycles, Multiply1x6_6x6Col_v4FPU, MatrixX1x6 * MatrixY6x6
132 cycles, Multiply1x8_8x8Col_v2FPU, MatrixX1x8 * MatrixY8x8
135 cycles, Multiply1x8_8x8Col_v3FPU, MatrixX1x8 * MatrixY8x8
138 cycles, Multiply1x8_8x8Col_v4FPU, MatrixX1x8 * MatrixY8x8
139 cycles, Multiply1x8_8x8Col_v1FPU, MatrixX1x8 * MatrixY8x8
---------------------------------------------------------------------------
LiaoMi:
***** Time table - LoopCount =1000000 *****
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)
8 cycles, Multiply1x4_4x4Lin_v3SSE, MatrixX1x4 * MatrixY4x4
10 cycles, Multiply1x4_4x4Lin_v1SSE, MatrixX1x4 * MatrixY4x4
12 cycles, Multiply1x4_4x4Lin_v2SSE, MatrixX1x4 * MatrixY4x4
23 cycles, Multiply1x6_6x6Lin_v2SSE, MatrixX1x6 * MatrixY6x6
25 cycles, Multiply1x6_6x6Lin_v3SSE, MatrixX1x6 * MatrixY6x6
38 cycles, Multiply1x8_8x8Lin_v1SSE, MatrixX1x8 * MatrixY8x8
43 cycles, Multiply1x8_8x8Lin_v2SSE, MatrixX1x8 * MatrixY8x8
104 cycles, Multiply1x4_4x4Lin_v1FPU, MatrixX1x4 * MatrixY4x4
109 cycles, Multiply1x4_4x4Lin_v2FPU, MatrixX1x4 * MatrixY4x4
109 cycles, Multiply1x4_4x4Lin_v4FPU, MatrixX1x4 * MatrixY4x4
110 cycles, Multiply1x4_4x4Lin_v3FPU, MatrixX1x4 * MatrixY4x4
142 cycles, Multiply1x6_6x6Lin_v2FPU, MatrixX1x6 * MatrixY6x6
144 cycles, Multiply1x6_6x6Lin_v1FPU, MatrixX1x6 * MatrixY6x6
148 cycles, Multiply1x6_6x6Lin_v4FPU, MatrixX1x6 * MatrixY6x6
156 cycles, Multiply1x6_6x6Lin_v3FPU, MatrixX1x6 * MatrixY6x6
164 cycles, Multiply1x8_8x8Lin_v4FPU, MatrixX1x8 * MatrixY8x8
167 cycles, Multiply1x8_8x8Lin_v1FPU, MatrixX1x8 * MatrixY8x8
199 cycles, Multiply1x8_8x8Lin_v2FPU, MatrixX1x8 * MatrixY8x8
206 cycles, Multiply1x8_8x8Lin_v3FPU, MatrixX1x8 * MatrixY8x8
***** Time table - LoopCount =1000000 *****
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)
4 cycles, Multiply1x4_4x4Col_v1SSE, MatrixX1x4 * MatrixY4x4
16 cycles, Multiply1x6_6x6Col_v1SSE, MatrixX1x6 * MatrixY6x6
19 cycles, Multiply1x8_8x8Col_v1SSE, MatrixX1x8 * MatrixY8x8
105 cycles, Multiply1x4_4x4Col_v3FPU, MatrixX1x4 * MatrixY4x4
107 cycles, Multiply1x4_4x4Col_v4FPU, MatrixX1x4 * MatrixY4x4
107 cycles, Multiply1x4_4x4Col_v2FPU, MatrixX1x4 * MatrixY4x4
112 cycles, Multiply1x4_4x4Col_v1FPU, MatrixX1x4 * MatrixY4x4
144 cycles, Multiply1x6_6x6Col_v3FPU, MatrixX1x6 * MatrixY6x6
144 cycles, Multiply1x6_6x6Col_v4FPU, MatrixX1x6 * MatrixY6x6
149 cycles, Multiply1x6_6x6Col_v1FPU, MatrixX1x6 * MatrixY6x6
151 cycles, Multiply1x6_6x6Col_v2FPU, MatrixX1x6 * MatrixY6x6
185 cycles, Multiply1x8_8x8Col_v3FPU, MatrixX1x8 * MatrixY8x8
192 cycles, Multiply1x8_8x8Col_v1FPU, MatrixX1x8 * MatrixY8x8
193 cycles, Multiply1x8_8x8Col_v2FPU, MatrixX1x8 * MatrixY8x8
195 cycles, Multiply1x8_8x8Col_v4FPU, MatrixX1x8 * MatrixY8x8
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Jochen:
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
21 cycles, Multiply1x4_4x4Lin_v3SSE, MatrixX1x4 * MatrixY4x4
21 cycles, Multiply1x4_4x4Lin_v1SSE, MatrixX1x4 * MatrixY4x4
25 cycles, Multiply1x4_4x4Lin_v2SSE, MatrixX1x4 * MatrixY4x4
47 cycles, Multiply1x6_6x6Lin_v2SSE, MatrixX1x6 * MatrixY6x6
50 cycles, Multiply1x6_6x6Lin_v3SSE, MatrixX1x6 * MatrixY6x6
81 cycles, Multiply1x8_8x8Lin_v1SSE, MatrixX1x8 * MatrixY8x8
85 cycles, Multiply1x8_8x8Lin_v2SSE, MatrixX1x8 * MatrixY8x8
94 cycles, Multiply1x4_4x4Lin_v4FPU, MatrixX1x4 * MatrixY4x4
94 cycles, Multiply1x4_4x4Lin_v1FPU, MatrixX1x4 * MatrixY4x4
97 cycles, Multiply1x4_4x4Lin_v3FPU, MatrixX1x4 * MatrixY4x4
98 cycles, Multiply1x4_4x4Lin_v2FPU, MatrixX1x4 * MatrixY4x4
114 cycles, Multiply1x6_6x6Lin_v2FPU, MatrixX1x6 * MatrixY6x6
119 cycles, Multiply1x6_6x6Lin_v4FPU, MatrixX1x6 * MatrixY6x6
126 cycles, Multiply1x6_6x6Lin_v1FPU, MatrixX1x6 * MatrixY6x6
126 cycles, Multiply1x6_6x6Lin_v3FPU, MatrixX1x6 * MatrixY6x6
139 cycles, Multiply1x8_8x8Lin_v4FPU, MatrixX1x8 * MatrixY8x8
143 cycles, Multiply1x8_8x8Lin_v1FPU, MatrixX1x8 * MatrixY8x8
157 cycles, Multiply1x8_8x8Lin_v2FPU, MatrixX1x8 * MatrixY8x8
158 cycles, Multiply1x8_8x8Lin_v3FPU, MatrixX1x8 * MatrixY8x8
***** Time table - LoopCount =1000000 *****
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
8 cycles, Multiply1x4_4x4Col_v1SSE, MatrixX1x4 * MatrixY4x4
22 cycles, Multiply1x6_6x6Col_v1SSE, MatrixX1x6 * MatrixY6x6
33 cycles, Multiply1x8_8x8Col_v1SSE, MatrixX1x8 * MatrixY8x8
94 cycles, Multiply1x4_4x4Col_v3FPU, MatrixX1x4 * MatrixY4x4
94 cycles, Multiply1x4_4x4Col_v2FPU, MatrixX1x4 * MatrixY4x4
95 cycles, Multiply1x4_4x4Col_v1FPU, MatrixX1x4 * MatrixY4x4
101 cycles, Multiply1x4_4x4Col_v4FPU, MatrixX1x4 * MatrixY4x4
118 cycles, Multiply1x6_6x6Col_v2FPU, MatrixX1x6 * MatrixY6x6
124 cycles, Multiply1x6_6x6Col_v3FPU, MatrixX1x6 * MatrixY6x6
127 cycles, Multiply1x6_6x6Col_v4FPU, MatrixX1x6 * MatrixY6x6
129 cycles, Multiply1x6_6x6Col_v1FPU, MatrixX1x6 * MatrixY6x6
155 cycles, Multiply1x8_8x8Col_v3FPU, MatrixX1x8 * MatrixY8x8
155 cycles, Multiply1x8_8x8Col_v2FPU, MatrixX1x8 * MatrixY8x8
162 cycles, Multiply1x8_8x8Col_v4FPU, MatrixX1x8 * MatrixY8x8
162 cycles, Multiply1x8_8x8Col_v1FPU, MatrixX1x8 * MatrixY8x8
Hi Rui,
The results,
Hi RuiLoureiro,
AMD Ryzen 7 1700 Eight-Core Processor (SSE4)
+
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
Hi Rui,
Core i5 here.