News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Timings for rep ret

Started by jj2007, July 05, 2023, 06:55:08 AM

Previous topic - Next topic

jj2007

Spinoff from the repz ret thread started by fearless:

IntelCore(TM)(TM) i5-M CPU @ U @ 2.50 (SSE4)

540     cycles for 100 * repz ret eq
539     cycles for 100 * repz ret ne
539     cycles for 100 * simple ret eq
539     cycles for 100 * simple ret ne
377     cycles for 100 * repz ret eq no branch
377     cycles for 100 * repz ret ne no branch

541     cycles for 100 * repz ret eq
538     cycles for 100 * repz ret ne
539     cycles for 100 * simple ret eq
538     cycles for 100 * simple ret ne
377     cycles for 100 * repz ret eq no branch
380     cycles for 100 * repz ret ne no branch

567     cycles for 100 * repz ret eq
539     cycles for 100 * repz ret ne
539     cycles for 100 * simple ret eq
553     cycles for 100 * simple ret ne
377     cycles for 100 * repz ret eq no branch
377     cycles for 100 * repz ret ne no branch

540     cycles for 100 * repz ret eq
539     cycles for 100 * repz ret ne
539     cycles for 100 * simple ret eq
539     cycles for 100 * simple ret ne
377     cycles for 100 * repz ret eq no branch
377     cycles for 100 * repz ret ne no branch

0       = eax repz ret eq
1       = eax repz ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax repz ret eq no branch
1       = eax repz ret ne no branch


RepRet proc
    cmp ecx, edx
    je equal
different:
    mov eax, 1
    jmp @F
equal:
    xor eax, eax
@@:    repz ret 
RepRet endp
...
RepRetNB proc
    xor eax, eax
    cmp ecx, edx
    sete al
    repz ret 
RepRetNB endp


Yep, branches are no good for performance :thumbsup:

HSE

A little more curious here:Intel(R)l(R) Cor) i3-101-10100 C 3.60GHz0GHz (SSE4)

416     cycles for 100 * rep ret eq
586     cycles for 100 * rep ret ne
535     cycles for 100 * simple ret eq
603     cycles for 100 * simple ret ne
260     cycles for 100 * rep ret eq no branch
266     cycles for 100 * rep ret ne no branch

426     cycles for 100 * rep ret eq
592     cycles for 100 * rep ret ne
531     cycles for 100 * simple ret eq
599     cycles for 100 * simple ret ne
259     cycles for 100 * rep ret eq no branch
265     cycles for 100 * rep ret ne no branch

425     cycles for 100 * rep ret eq
578     cycles for 100 * rep ret ne
519     cycles for 100 * simple ret eq
603     cycles for 100 * simple ret ne
264     cycles for 100 * rep ret eq no branch
265     cycles for 100 * rep ret ne no branch

427     cycles for 100 * rep ret eq
600     cycles for 100 * rep ret ne
544     cycles for 100 * simple ret eq
596     cycles for 100 * simple ret ne
260     cycles for 100 * rep ret eq no branch
265     cycles for 100 * rep ret ne no branch

0       = eax rep ret eq
1       = eax rep ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax rep ret eq no branch
1       = eax rep ret ne no branch

-
Equations in Assembly: SmplMath

zedd151



Intel(R)l(R) Cor)2 Duo Cuo CPU  8400  @   @ 3.00 (SSE4)

697     cycles for 100 * rep ret eq
695     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
696     cycles for 100 * simple ret ne
497     cycles for 100 * rep ret eq no branch
497     cycles for 100 * rep ret ne no branch

696     cycles for 100 * rep ret eq
695     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
695     cycles for 100 * simple ret ne
498     cycles for 100 * rep ret eq no branch
496     cycles for 100 * rep ret ne no branch

698     cycles for 100 * rep ret eq
697     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
697     cycles for 100 * simple ret ne
497     cycles for 100 * rep ret eq no branch
498     cycles for 100 * rep ret ne no branch

702     cycles for 100 * rep ret eq
696     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
695     cycles for 100 * simple ret ne
496     cycles for 100 * rep ret eq no branch
497     cycles for 100 * rep ret ne no branch

0       = eax rep ret eq
1       = eax rep ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax rep ret eq no branch
1       = eax rep ret ne no branch

--- ok ---

jj2007

It gets complicated, so I put a version 2 on top. The timings are unchanged, but:

- in the first version, ml.exe accepts repz ret but puts a simple retn there (a bug)
- in the second version, I used the db 0F3h, 0C3h syntax to get rep retn

I also tested db 0F2h, 0C3h alias repne ret but found no speed difference.
Some sites say it's an AMD only optimisation - anybody around with an AMD cpu?

zedd151



Intel(R)l(R) Cor)2 Duo Cuo CPU  8400  @   @ 3.00 (SSE4)


695     cycles for 100 * rep ret eq
699     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
695     cycles for 100 * simple ret ne
498     cycles for 100 * rep ret eq no branch
496     cycles for 100 * rep ret ne no branch


699     cycles for 100 * rep ret eq
700     cycles for 100 * rep ret ne
695     cycles for 100 * simple ret eq
695     cycles for 100 * simple ret ne
496     cycles for 100 * rep ret eq no branch
496     cycles for 100 * rep ret ne no branch


697     cycles for 100 * rep ret eq
696     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
696     cycles for 100 * simple ret ne
496     cycles for 100 * rep ret eq no branch
499     cycles for 100 * rep ret ne no branch


695     cycles for 100 * rep ret eq
696     cycles for 100 * rep ret ne
696     cycles for 100 * simple ret eq
695     cycles for 100 * simple ret ne
497     cycles for 100 * rep ret eq no branch
497     cycles for 100 * rep ret ne no branch


0       = eax rep ret eq
1       = eax rep ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax rep ret eq no branch
1       = eax rep ret ne no branch


--- ok ---

About the same for this tired intel...

jj2007

From a FASM forum:
QuoteIt would seem that when AMD designed the branch predictors they make a mistake when the branch led to a 'ret'. So to solve the problem, they say to put a 'rep' in front so that the branch predictor will work correctly and do predictions.

From a GCC forum:
QuoteAMD recommends to avoid the penalty by adding rep prefix instead of nop
because it saves decode bandwidth.

Btw according to OllyDbg there is no repz retn instruction: it's either repnz retn or rep retn. StackOverflow disagrees :cool:

fearless

Microsoft Windows [Version 10.0.19045.3155]
(c) Microsoft Corporation. All rights reserved.

A:\Downloads\RepRetV2>repret
AMD RyzeRyzen 9 X 16-Cor-Core Prsor             ul (SSE4)

355     cycles for 100 * rep ret eq
354     cycles for 100 * rep ret ne
353     cycles for 100 * simple ret eq
353     cycles for 100 * simple ret ne
284     cycles for 100 * rep ret eq no branch
282     cycles for 100 * rep ret ne no branch

352     cycles for 100 * rep ret eq
354     cycles for 100 * rep ret ne
356     cycles for 100 * simple ret eq
356     cycles for 100 * simple ret ne
282     cycles for 100 * rep ret eq no branch
283     cycles for 100 * rep ret ne no branch

355     cycles for 100 * rep ret eq
356     cycles for 100 * rep ret ne
369     cycles for 100 * simple ret eq
354     cycles for 100 * simple ret ne
284     cycles for 100 * rep ret eq no branch
286     cycles for 100 * rep ret ne no branch

356     cycles for 100 * rep ret eq
357     cycles for 100 * rep ret ne
354     cycles for 100 * simple ret eq
354     cycles for 100 * simple ret ne
286     cycles for 100 * rep ret eq no branch
282     cycles for 100 * rep ret ne no branch

0       = eax rep ret eq
1       = eax rep ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax rep ret eq no branch
1       = eax rep ret ne no branch

--- ok ---

jj2007

Thanks, fearless. So even on AMD it makes absolutely no difference. Good to know :thumbsup:

Caché GB

repz = repe ; What say OllyDbg?
Caché GB's 1 and 0-nly language:MASM

jj2007

Quote from: Caché GB on July 05, 2023, 11:29:54 AM
repz = repe ; What say OllyDbg?
Right. Try Google phrase searches, i.e. with quotes:
"repne retn"
"repnz retn"
"repe retn"
"repz retn"

db 0F3h, 0C3h     ; rep=repe=repz retn
db 0F2h, 0C3h     ; repne, repnz

Olly:
0040108E  \.  F2            repne                                    ; Unknown command
0040108F      C3            retn

0040108E  \.  F3:C3         rep retn

Oleh Yuschuk knew very well what he was doing 10 years ago ;-)

Caché GB

Quote from: jj2007 on July 05, 2023, 07:14:24 PM
Oleh Yuschuk knew very well what he was doing 10 years ago ;-)

So true.  :thumbsup:
Caché GB's 1 and 0-nly language:MASM

HSE

JJ,
   there is a problem in your benchmark:
QuoteTestOH:
  mov ebx, AlgoLoops-1   ; loop e.g. 100xf
  align 4
  .Repeat
       call Nonothing
       dec ebx
  .Until Sign?
  ret

You was forgetting that all test make a call


Then more realistic measurement:Intel(R)l(R) Cor) i3-101-10100 C 3.60GHz0GHz (SSE4)

0       cycles for 100 * rep ret eq
241     cycles for 100 * rep ret ne
177     cycles for 100 * simple ret eq
260     cycles for 100 * simple ret ne
??      cycles for 100 * rep ret eq no branch
??      cycles for 100 * rep ret ne no branch

11      cycles for 100 * rep ret eq
256     cycles for 100 * rep ret ne
180     cycles for 100 * simple ret eq
261     cycles for 100 * simple ret ne
??      cycles for 100 * rep ret eq no branch
??      cycles for 100 * rep ret ne no branch

3       cycles for 100 * rep ret eq
257     cycles for 100 * rep ret ne
176     cycles for 100 * simple ret eq
261     cycles for 100 * simple ret ne
??      cycles for 100 * rep ret eq no branch
??      cycles for 100 * rep ret ne no branch

0       cycles for 100 * rep ret eq
257     cycles for 100 * rep ret ne
175     cycles for 100 * simple ret eq
264     cycles for 100 * simple ret ne
??      cycles for 100 * rep ret eq no branch
??      cycles for 100 * rep ret ne no branch

0       = eax rep ret eq
1       = eax rep ret ne
0       = eax simple ret eq
1       = eax simple ret ne
0       = eax rep ret eq no branch
1       = eax rep ret ne no branch

--- ok ---


Additionally, some tests have mov edx, ecx or lea edx, [ecx+1] that also must be corrected.  :thumbsup:
Equations in Assembly: SmplMath

jj2007

Dear HSE,

I'm afraid you haven't fully understood the code. Check what TestA ... TestF are doing.

Yes, all tests have mov edx, ecx or lea edx, [ecx+1] - but there is nothing to be corrected, they serve a purpose.

FORTRANS

Hi,

   Three systems.  The processor labels seem scrambled?

Intel(R)l(R) Cor) i3-400-4005U C 1.70GHz0GHz (SSE4)

494   cycles for 100 * rep ret eq
757   cycles for 100 * rep ret ne
665   cycles for 100 * simple ret eq
765   cycles for 100 * simple ret ne
269   cycles for 100 * rep ret eq no branch
271   cycles for 100 * rep ret ne no branch

470   cycles for 100 * rep ret eq
764   cycles for 100 * rep ret ne
665   cycles for 100 * simple ret eq
765   cycles for 100 * simple ret ne
267   cycles for 100 * rep ret eq no branch
271   cycles for 100 * rep ret ne no branch

470   cycles for 100 * rep ret eq
601   cycles for 100 * rep ret ne
662   cycles for 100 * simple ret eq
724   cycles for 100 * simple ret ne
280   cycles for 100 * rep ret eq no branch
271   cycles for 100 * rep ret ne no branch

481   cycles for 100 * rep ret eq
766   cycles for 100 * rep ret ne
664   cycles for 100 * simple ret eq
765   cycles for 100 * simple ret ne
269   cycles for 100 * rep ret eq no branch
270   cycles for 100 * rep ret ne no branch

0   = eax rep ret eq
1   = eax rep ret ne
0   = eax simple ret eq
1   = eax simple ret ne
0   = eax rep ret eq no branch
1   = eax rep ret ne no branch

--- ok ---

Inte Pentiumtium(R) ocessor sor 1.70 (SSE2)

706   cycles for 100 * rep ret eq
698   cycles for 100 * rep ret ne
720   cycles for 100 * simple ret eq
700   cycles for 100 * simple ret ne
603   cycles for 100 * rep ret eq no branch
604   cycles for 100 * rep ret ne no branch

706   cycles for 100 * rep ret eq
705   cycles for 100 * rep ret ne
702   cycles for 100 * simple ret eq
705   cycles for 100 * simple ret ne
605   cycles for 100 * rep ret eq no branch
618   cycles for 100 * rep ret ne no branch

709   cycles for 100 * rep ret eq
703   cycles for 100 * rep ret ne
701   cycles for 100 * simple ret eq
702   cycles for 100 * simple ret ne
599   cycles for 100 * rep ret eq no branch
603   cycles for 100 * rep ret ne no branch

702   cycles for 100 * rep ret eq
702   cycles for 100 * rep ret ne
716   cycles for 100 * simple ret eq
699   cycles for 100 * simple ret ne
604   cycles for 100 * rep ret eq no branch
603   cycles for 100 * rep ret ne no branch

0   = eax rep ret eq
1   = eax rep ret ne
0   = eax simple ret eq
1   = eax simple ret ne
0   = eax rep ret eq no branch
1   = eax rep ret ne no branch

--- ok ---

Intel(R)l(R) Cor) i3-101-10110U @ 2.10GH10GHz (SSE4)

402   cycles for 100 * rep ret eq
500   cycles for 100 * rep ret ne
480   cycles for 100 * simple ret eq
497   cycles for 100 * simple ret ne
328   cycles for 100 * rep ret eq no branch
267   cycles for 100 * rep ret ne no branch

370   cycles for 100 * rep ret eq
504   cycles for 100 * rep ret ne
516   cycles for 100 * simple ret eq
519   cycles for 100 * simple ret ne
336   cycles for 100 * rep ret eq no branch
289   cycles for 100 * rep ret ne no branch

374   cycles for 100 * rep ret eq
509   cycles for 100 * rep ret ne
455   cycles for 100 * simple ret eq
506   cycles for 100 * simple ret ne
283   cycles for 100 * rep ret eq no branch
350   cycles for 100 * rep ret ne no branch

481   cycles for 100 * rep ret eq
535   cycles for 100 * rep ret ne
599   cycles for 100 * simple ret eq
487   cycles for 100 * simple ret ne
247   cycles for 100 * rep ret eq no branch
257   cycles for 100 * rep ret ne no branch

0   = eax rep ret eq
1   = eax rep ret ne
0   = eax simple ret eq
1   = eax simple ret ne
0   = eax rep ret eq no branch
1   = eax rep ret ne no branch

--- ok ---


Regards,

Steve N.

HSE

Quote from: jj2007 on July 06, 2023, 07:40:29 AM
Check what TestA ... TestF are doing.

Yes, they are making a call, and your reference (TestOH) doesn't make any call  :biggrin:

far call and far ret are very slow things.
Equations in Assembly: SmplMath