News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

ORPS vs POR timings

Started by jj2007, September 27, 2014, 03:09:02 AM

Previous topic - Next topic

jj2007

Little test?

Intel's IA-32 Software Developer's Manual gives this warning:

QuoteIn this example: XORPS or PXOR can be used in place of XORPD
   and yield the same correct result. However, because of the type
   mismatch between the operand data type and the instruction data
   type, a latency penalty will be incurred due to implementations
   of the instructions at the microarchitecture level.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
237     cycles for 100 * por
237     cycles for 100 * orps

231     cycles for 100 * por
232     cycles for 100 * orps

235     cycles for 100 * por
235     cycles for 100 * orps

10      bytes for por
9       bytes for orps

TouEnMasm

Intel(R) Celeron(R) CPU 2.80GHz (SSE3)

239     cycles for 100 * por
240     cycles for 100 * orps

243     cycles for 100 * por
241     cycles for 100 * orps

239     cycles for 100 * por
239     cycles for 100 * orps

10      bytes for por
9       bytes for orps
Fa is a musical note to play with CL

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

172     cycles for 100 * por
222     cycles for 100 * orps

172     cycles for 100 * por
222     cycles for 100 * orps

170     cycles for 100 * por
222     cycles for 100 * orps

10      bytes for por
9       bytes for orps
Creative coders use backward thinking techniques as a strategy.

jj2007

Thanks, Yves & Marinus. So ix shows a difference. Anybody with an AMD?

hutch--

Its probably worth knowing the context that Intel had in mind when they wrote the comparison. This is the result on my i7 64 bit box.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

208     cycles for 100 * por
208     cycles for 100 * orps

208     cycles for 100 * por
208     cycles for 100 * orps

208     cycles for 100 * por
208     cycles for 100 * orps

10      bytes for por
9       bytes for orps


--- ok ---

jj2007

Hmmmm...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

142     cycles for 100 * por
193     cycles for 100 * orps

166     cycles for 100 * por
201     cycles for 100 * orps

139     cycles for 100 * por
183     cycles for 100 * orps

10      bytes for por
9       bytes for orps


Attached another test, timing cmp reg16, -1 vs reg32, 65535

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
285     cycles for 100 * cmp esi
286     cycles for 100 * cmp si
287     cycles for 100 * cmp dx

289     cycles for 100 * cmp esi
290     cycles for 100 * cmp si
287     cycles for 100 * cmp dx

287     cycles for 100 * cmp esi
288     cycles for 100 * cmp si
292     cycles for 100 * cmp dx


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
182     cycles for 100 * cmp esi
185     cycles for 100 * cmp si
182     cycles for 100 * cmp dx

181     cycles for 100 * cmp esi
185     cycles for 100 * cmp si
184     cycles for 100 * cmp dx

189     cycles for 100 * cmp esi
179     cycles for 100 * cmp si
183     cycles for 100 * cmp dx

4       bytes for cmp esi
5       bytes for cmp si
5       bytes for cmp dx

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

347     cycles for 100 * cmp esi
340     cycles for 100 * cmp si
340     cycles for 100 * cmp dx

425     cycles for 100 * cmp esi
346     cycles for 100 * cmp si
342     cycles for 100 * cmp dx

340     cycles for 100 * cmp esi
347     cycles for 100 * cmp si
340     cycles for 100 * cmp dx


372     cycles for 100 * cmp esi
346     cycles for 100 * cmp si
340     cycles for 100 * cmp dx

376     cycles for 100 * cmp esi
341     cycles for 100 * cmp si
339     cycles for 100 * cmp dx

376     cycles for 100 * cmp esi
345     cycles for 100 * cmp si
340     cycles for 100 * cmp dx


347     cycles for 100 * cmp esi
345     cycles for 100 * cmp si
340     cycles for 100 * cmp dx

340     cycles for 100 * cmp esi
339     cycles for 100 * cmp si
339     cycles for 100 * cmp dx

347     cycles for 100 * cmp esi
341     cycles for 100 * cmp si
373     cycles for 100 * cmp dx


seems like the loop count is a little low   :P

sinsi


C:\Users\tester\Desktop>OrpsPor.exe
AMD A10-7850K APU with Radeon(TM) R7 Graphics   (SSE4)

209     cycles for 100 * por
209     cycles for 100 * orps

209     cycles for 100 * por
210     cycles for 100 * orps

209     cycles for 100 * por
209     cycles for 100 * orps

C:\Users\tester\Desktop>CmpSi.exe
AMD A10-7850K APU with Radeon(TM) R7 Graphics   (SSE4)

283     cycles for 100 * cmp esi
242     cycles for 100 * cmp si
244     cycles for 100 * cmp dx

281     cycles for 100 * cmp esi
239     cycles for 100 * cmp si
240     cycles for 100 * cmp dx

282     cycles for 100 * cmp esi
246     cycles for 100 * cmp si
244     cycles for 100 * cmp dx


Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

223     cycles for 100 * cmp esi
223     cycles for 100 * cmp si
220     cycles for 100 * cmp dx

225     cycles for 100 * cmp esi
223     cycles for 100 * cmp si
221     cycles for 100 * cmp dx

225     cycles for 100 * cmp esi
224     cycles for 100 * cmp si
223     cycles for 100 * cmp dx

4       bytes for cmp esi
5       bytes for cmp si
5       bytes for cmp dx
Creative coders use backward thinking techniques as a strategy.

jj2007

Thanxalot :biggrin:

In case somebody is curious:

ciL1:
      movups xmm0, oword ptr [ecx]
      movapd xmm1, [edx]
      movaps xmm2, xmm4
      pcmpgtb xmm2, xmm1
      por xmm0, xmm3          ; orps is shorter but marginally
      por xmm1, xmm3          ; slower e.g. on i7-4930K
      pcmpeqb xmm0, xmm1      ; compare packed bytes for equality
      lea edx, [edx+16]       ; increase position
      lea ecx, [ecx+16]       ; in memory
      pmovmskb eax, xmm2      ; set byte mask in eax
      pmovmskb esi, xmm0      ; set byte mask in esi
      test eax, eax
      jne @F
      cmp si, -1
      je ciL1

hutch--

Try replacing the LEA with ADD, it should be faster on Intel hardware.

jj2007

Quote from: hutch-- on September 27, 2014, 12:24:05 PM
Try replacing the LEA with ADD, it should be faster on Intel hardware.

To be tested, of course... :biggrin:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
453     cycles for 100 * lea
459     cycles for 100 * add

454     cycles for 100 * lea
458     cycles for 100 * add

453     cycles for 100 * lea
463     cycles for 100 * add


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
226     cycles for 100 * lea
252     cycles for 100 * add

235     cycles for 100 * lea
306     cycles for 100 * add

218     cycles for 100 * lea
256     cycles for 100 * add


Size is identical.

hutch--

Seems to be a normal hardware variation, this is the result on this i7.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

333     cycles for 100 * lea
334     cycles for 100 * add

337     cycles for 100 * lea
334     cycles for 100 * add

333     cycles for 100 * lea
334     cycles for 100 * add

24      bytes for lea
24      bytes for add


--- ok ---


guga

Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)

182 cycles for 100 * por
185 cycles for 100 * orps

187 cycles for 100 * por
200 cycles for 100 * orps

185 cycles for 100 * por
184 cycles for 100 * orps

10 bytes for por
9 bytes for orps


--- ok ---





Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)

282 cycles for 100 * lea
283 cycles for 100 * add

281 cycles for 100 * lea
282 cycles for 100 * add

281 cycles for 100 * lea
283 cycles for 100 * add

24 bytes for lea
24 bytes for add


--- ok ---

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

267     cycles for 100 * lea
282     cycles for 100 * add

269     cycles for 100 * lea
280     cycles for 100 * add

265     cycles for 100 * lea
282     cycles for 100 * add

24      bytes for lea
24      bytes for add
Creative coders use backward thinking techniques as a strategy.