Little test?
Intel's IA-32 Software Developer's Manual gives this warning: (http://gcc.gnu.org/ml/gcc/2007-08/msg00376.html)
QuoteIn this example: XORPS or PXOR can be used in place of XORPD
and yield the same correct result. However, because of the type
mismatch between the operand data type and the instruction data
type, a latency penalty will be incurred due to implementations
of the instructions at the microarchitecture level.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
237 cycles for 100 * por
237 cycles for 100 * orps
231 cycles for 100 * por
232 cycles for 100 * orps
235 cycles for 100 * por
235 cycles for 100 * orps
10 bytes for por
9 bytes for orps
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
239 cycles for 100 * por
240 cycles for 100 * orps
243 cycles for 100 * por
241 cycles for 100 * orps
239 cycles for 100 * por
239 cycles for 100 * orps
10 bytes for por
9 bytes for orps
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
172 cycles for 100 * por
222 cycles for 100 * orps
172 cycles for 100 * por
222 cycles for 100 * orps
170 cycles for 100 * por
222 cycles for 100 * orps
10 bytes for por
9 bytes for orps
Thanks, Yves & Marinus. So ix shows a difference. Anybody with an AMD?
Its probably worth knowing the context that Intel had in mind when they wrote the comparison. This is the result on my i7 64 bit box.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
208 cycles for 100 * por
208 cycles for 100 * orps
208 cycles for 100 * por
208 cycles for 100 * orps
208 cycles for 100 * por
208 cycles for 100 * orps
10 bytes for por
9 bytes for orps
--- ok ---
Hmmmm...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
142 cycles for 100 * por
193 cycles for 100 * orps
166 cycles for 100 * por
201 cycles for 100 * orps
139 cycles for 100 * por
183 cycles for 100 * orps
10 bytes for por
9 bytes for orps
Attached another test, timing cmp reg16, -1 vs reg32, 65535
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
285 cycles for 100 * cmp esi
286 cycles for 100 * cmp si
287 cycles for 100 * cmp dx
289 cycles for 100 * cmp esi
290 cycles for 100 * cmp si
287 cycles for 100 * cmp dx
287 cycles for 100 * cmp esi
288 cycles for 100 * cmp si
292 cycles for 100 * cmp dx
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
182 cycles for 100 * cmp esi
185 cycles for 100 * cmp si
182 cycles for 100 * cmp dx
181 cycles for 100 * cmp esi
185 cycles for 100 * cmp si
184 cycles for 100 * cmp dx
189 cycles for 100 * cmp esi
179 cycles for 100 * cmp si
183 cycles for 100 * cmp dx
4 bytes for cmp esi
5 bytes for cmp si
5 bytes for cmp dx
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
347 cycles for 100 * cmp esi
340 cycles for 100 * cmp si
340 cycles for 100 * cmp dx
425 cycles for 100 * cmp esi
346 cycles for 100 * cmp si
342 cycles for 100 * cmp dx
340 cycles for 100 * cmp esi
347 cycles for 100 * cmp si
340 cycles for 100 * cmp dx
372 cycles for 100 * cmp esi
346 cycles for 100 * cmp si
340 cycles for 100 * cmp dx
376 cycles for 100 * cmp esi
341 cycles for 100 * cmp si
339 cycles for 100 * cmp dx
376 cycles for 100 * cmp esi
345 cycles for 100 * cmp si
340 cycles for 100 * cmp dx
347 cycles for 100 * cmp esi
345 cycles for 100 * cmp si
340 cycles for 100 * cmp dx
340 cycles for 100 * cmp esi
339 cycles for 100 * cmp si
339 cycles for 100 * cmp dx
347 cycles for 100 * cmp esi
341 cycles for 100 * cmp si
373 cycles for 100 * cmp dx
seems like the loop count is a little low :P
C:\Users\tester\Desktop>OrpsPor.exe
AMD A10-7850K APU with Radeon(TM) R7 Graphics (SSE4)
209 cycles for 100 * por
209 cycles for 100 * orps
209 cycles for 100 * por
210 cycles for 100 * orps
209 cycles for 100 * por
209 cycles for 100 * orps
C:\Users\tester\Desktop>CmpSi.exe
AMD A10-7850K APU with Radeon(TM) R7 Graphics (SSE4)
283 cycles for 100 * cmp esi
242 cycles for 100 * cmp si
244 cycles for 100 * cmp dx
281 cycles for 100 * cmp esi
239 cycles for 100 * cmp si
240 cycles for 100 * cmp dx
282 cycles for 100 * cmp esi
246 cycles for 100 * cmp si
244 cycles for 100 * cmp dx
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
223 cycles for 100 * cmp esi
223 cycles for 100 * cmp si
220 cycles for 100 * cmp dx
225 cycles for 100 * cmp esi
223 cycles for 100 * cmp si
221 cycles for 100 * cmp dx
225 cycles for 100 * cmp esi
224 cycles for 100 * cmp si
223 cycles for 100 * cmp dx
4 bytes for cmp esi
5 bytes for cmp si
5 bytes for cmp dx
Thanxalot :biggrin:
In case somebody is curious:
ciL1:
movups xmm0, oword ptr [ecx]
movapd xmm1, [edx]
movaps xmm2, xmm4
pcmpgtb xmm2, xmm1
por xmm0, xmm3 ; orps is shorter but marginally
por xmm1, xmm3 ; slower e.g. on i7-4930K
pcmpeqb xmm0, xmm1 ; compare packed bytes for equality
lea edx, [edx+16] ; increase position
lea ecx, [ecx+16] ; in memory
pmovmskb eax, xmm2 ; set byte mask in eax
pmovmskb esi, xmm0 ; set byte mask in esi
test eax, eax
jne @F
cmp si, -1
je ciL1
Try replacing the LEA with ADD, it should be faster on Intel hardware.
Quote from: hutch-- on September 27, 2014, 12:24:05 PM
Try replacing the LEA with ADD, it should be faster on Intel hardware.
To be tested, of course... :biggrin:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
453 cycles for 100 * lea
459 cycles for 100 * add
454 cycles for 100 * lea
458 cycles for 100 * add
453 cycles for 100 * lea
463 cycles for 100 * add
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
226 cycles for 100 * lea
252 cycles for 100 * add
235 cycles for 100 * lea
306 cycles for 100 * add
218 cycles for 100 * lea
256 cycles for 100 * addSize is identical.
Seems to be a normal hardware variation, this is the result on this i7.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
333 cycles for 100 * lea
334 cycles for 100 * add
337 cycles for 100 * lea
334 cycles for 100 * add
333 cycles for 100 * lea
334 cycles for 100 * add
24 bytes for lea
24 bytes for add
--- ok ---
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4)
182 cycles for 100 * por
185 cycles for 100 * orps
187 cycles for 100 * por
200 cycles for 100 * orps
185 cycles for 100 * por
184 cycles for 100 * orps
10 bytes for por
9 bytes for orps
--- ok ---
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4)
282 cycles for 100 * lea
283 cycles for 100 * add
281 cycles for 100 * lea
282 cycles for 100 * add
281 cycles for 100 * lea
283 cycles for 100 * add
24 bytes for lea
24 bytes for add
--- ok ---
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
267 cycles for 100 * lea
282 cycles for 100 * add
269 cycles for 100 * lea
280 cycles for 100 * add
265 cycles for 100 * lea
282 cycles for 100 * add
24 bytes for lea
24 bytes for add
Jochen,
the results from CmpSi:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
198 cycles for 100 * cmp esi
193 cycles for 100 * cmp si
193 cycles for 100 * cmp dx
193 cycles for 100 * cmp esi
198 cycles for 100 * cmp si
200 cycles for 100 * cmp dx
193 cycles for 100 * cmp esi
201 cycles for 100 * cmp si
191 cycles for 100 * cmp dx
4 bytes for cmp esi
5 bytes for cmp si
5 bytes for cmp dx
Some more to test?
Gunther
(http://jeffbaij.com/blog/images/deadhorse.gif)
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
555 cycles for 100 * lea
586 cycles for 100 * add
579 cycles for 100 * lea
570 cycles for 100 * add
554 cycles for 100 * lea
558 cycles for 100 * add
Thanks to everybody for your contribution to making MasmBasic the fastest 32-bit library on Earth ;-)
Quote from: jj2007 on September 28, 2014, 10:28:25 PM
Thanks to everybody for your contribution to making MasmBasic the fastest 32-bit library on Earth ;-)
Do not exaggerate like that.
Gunther