Spinoff from the repz ret thread (https://masm32.com/board/index.php?topic=10873.0) started by fearless:
IntelCore(TM)(TM) i5-M CPU @ U @ 2.50 (SSE4)
540 cycles for 100 * repz ret eq
539 cycles for 100 * repz ret ne
539 cycles for 100 * simple ret eq
539 cycles for 100 * simple ret ne
377 cycles for 100 * repz ret eq no branch
377 cycles for 100 * repz ret ne no branch
541 cycles for 100 * repz ret eq
538 cycles for 100 * repz ret ne
539 cycles for 100 * simple ret eq
538 cycles for 100 * simple ret ne
377 cycles for 100 * repz ret eq no branch
380 cycles for 100 * repz ret ne no branch
567 cycles for 100 * repz ret eq
539 cycles for 100 * repz ret ne
539 cycles for 100 * simple ret eq
553 cycles for 100 * simple ret ne
377 cycles for 100 * repz ret eq no branch
377 cycles for 100 * repz ret ne no branch
540 cycles for 100 * repz ret eq
539 cycles for 100 * repz ret ne
539 cycles for 100 * simple ret eq
539 cycles for 100 * simple ret ne
377 cycles for 100 * repz ret eq no branch
377 cycles for 100 * repz ret ne no branch
0 = eax repz ret eq
1 = eax repz ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax repz ret eq no branch
1 = eax repz ret ne no branch
RepRet proc
cmp ecx, edx
je equal
different:
mov eax, 1
jmp @F
equal:
xor eax, eax
@@: repz ret
RepRet endp
...
RepRetNB proc
xor eax, eax
cmp ecx, edx
sete al
repz ret
RepRetNB endp
Yep, branches are no good for performance :thumbsup:
A little more curious here:Intel(R)l(R) Cor) i3-101-10100 C 3.60GHz0GHz (SSE4)
416 cycles for 100 * rep ret eq
586 cycles for 100 * rep ret ne
535 cycles for 100 * simple ret eq
603 cycles for 100 * simple ret ne
260 cycles for 100 * rep ret eq no branch
266 cycles for 100 * rep ret ne no branch
426 cycles for 100 * rep ret eq
592 cycles for 100 * rep ret ne
531 cycles for 100 * simple ret eq
599 cycles for 100 * simple ret ne
259 cycles for 100 * rep ret eq no branch
265 cycles for 100 * rep ret ne no branch
425 cycles for 100 * rep ret eq
578 cycles for 100 * rep ret ne
519 cycles for 100 * simple ret eq
603 cycles for 100 * simple ret ne
264 cycles for 100 * rep ret eq no branch
265 cycles for 100 * rep ret ne no branch
427 cycles for 100 * rep ret eq
600 cycles for 100 * rep ret ne
544 cycles for 100 * simple ret eq
596 cycles for 100 * simple ret ne
260 cycles for 100 * rep ret eq no branch
265 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
-
Intel(R)l(R) Cor)2 Duo Cuo CPU 8400 @ @ 3.00 (SSE4)
697 cycles for 100 * rep ret eq
695 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
696 cycles for 100 * simple ret ne
497 cycles for 100 * rep ret eq no branch
497 cycles for 100 * rep ret ne no branch
696 cycles for 100 * rep ret eq
695 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
695 cycles for 100 * simple ret ne
498 cycles for 100 * rep ret eq no branch
496 cycles for 100 * rep ret ne no branch
698 cycles for 100 * rep ret eq
697 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
697 cycles for 100 * simple ret ne
497 cycles for 100 * rep ret eq no branch
498 cycles for 100 * rep ret ne no branch
702 cycles for 100 * rep ret eq
696 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
695 cycles for 100 * simple ret ne
496 cycles for 100 * rep ret eq no branch
497 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
It gets complicated, so I put a version 2 on top. The timings are unchanged, but:
- in the first version, ml.exe accepts repz ret but puts a simple retn there (a bug)
- in the second version, I used the db 0F3h, 0C3h syntax to get rep retn
I also tested db 0F2h, 0C3h alias repne ret but found no speed difference.
Some sites say it's an AMD only optimisation - anybody around with an AMD cpu?
Intel(R)l(R) Cor)2 Duo Cuo CPU 8400 @ @ 3.00 (SSE4)
695 cycles for 100 * rep ret eq
699 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
695 cycles for 100 * simple ret ne
498 cycles for 100 * rep ret eq no branch
496 cycles for 100 * rep ret ne no branch
699 cycles for 100 * rep ret eq
700 cycles for 100 * rep ret ne
695 cycles for 100 * simple ret eq
695 cycles for 100 * simple ret ne
496 cycles for 100 * rep ret eq no branch
496 cycles for 100 * rep ret ne no branch
697 cycles for 100 * rep ret eq
696 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
696 cycles for 100 * simple ret ne
496 cycles for 100 * rep ret eq no branch
499 cycles for 100 * rep ret ne no branch
695 cycles for 100 * rep ret eq
696 cycles for 100 * rep ret ne
696 cycles for 100 * simple ret eq
695 cycles for 100 * simple ret ne
497 cycles for 100 * rep ret eq no branch
497 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
About the same for this tired intel...
From a FASM forum (https://board.flatassembler.net/topic.php?t=11190):
QuoteIt would seem that when AMD designed the branch predictors they make a mistake when the branch led to a 'ret'. So to solve the problem, they say to put a 'rep' in front so that the branch predictor will work correctly and do predictions.
From a GCC forum (https://gcc.gnu.org/legacy-ml/gcc-patches/2003-05/msg02117.html):
QuoteAMD recommends to avoid the penalty by adding rep prefix instead of nop
because it saves decode bandwidth.
Btw according to OllyDbg there is no repz retn instruction: it's either repnz retn or rep retn. StackOverflow disagrees (https://stackoverflow.com/questions/39863255/repz-ret-why-all-the-hassle) :cool:
Microsoft Windows [Version 10.0.19045.3155]
(c) Microsoft Corporation. All rights reserved.
A:\Downloads\RepRetV2>repret
AMD RyzeRyzen 9 X 16-Cor-Core Prsor ul (SSE4)
355 cycles for 100 * rep ret eq
354 cycles for 100 * rep ret ne
353 cycles for 100 * simple ret eq
353 cycles for 100 * simple ret ne
284 cycles for 100 * rep ret eq no branch
282 cycles for 100 * rep ret ne no branch
352 cycles for 100 * rep ret eq
354 cycles for 100 * rep ret ne
356 cycles for 100 * simple ret eq
356 cycles for 100 * simple ret ne
282 cycles for 100 * rep ret eq no branch
283 cycles for 100 * rep ret ne no branch
355 cycles for 100 * rep ret eq
356 cycles for 100 * rep ret ne
369 cycles for 100 * simple ret eq
354 cycles for 100 * simple ret ne
284 cycles for 100 * rep ret eq no branch
286 cycles for 100 * rep ret ne no branch
356 cycles for 100 * rep ret eq
357 cycles for 100 * rep ret ne
354 cycles for 100 * simple ret eq
354 cycles for 100 * simple ret ne
286 cycles for 100 * rep ret eq no branch
282 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
Thanks, fearless. So even on AMD it makes absolutely no difference. Good to know :thumbsup:
repz = repe ; What say OllyDbg?
Quote from: Caché GB on July 05, 2023, 11:29:54 AM
repz = repe ; What say OllyDbg?
Right. Try Google phrase searches, i.e. with quotes:
"repne retn"
"repnz retn"
"repe retn"
"repz retn"
db 0F3h, 0C3h ; rep=repe=repz retn
db 0F2h, 0C3h ; repne, repnz
Olly:
0040108E \. F2 repne ; Unknown command
0040108F C3 retn
0040108E \. F3:C3 rep retn
Oleh Yuschuk knew very well what he was doing 10 years ago ;-)
Quote from: jj2007 on July 05, 2023, 07:14:24 PM
Oleh Yuschuk knew very well what he was doing 10 years ago ;-)
So true. :thumbsup:
JJ,
there is a problem in your benchmark:
QuoteTestOH:
mov ebx, AlgoLoops-1 ; loop e.g. 100xf
align 4
.Repeat
call Nonothing
dec ebx
.Until Sign?
ret
You was forgetting that all test make a
callThen more realistic measurement:
Intel(R)l(R) Cor) i3-101-10100 C 3.60GHz0GHz (SSE4)
0 cycles for 100 * rep ret eq
241 cycles for 100 * rep ret ne
177 cycles for 100 * simple ret eq
260 cycles for 100 * simple ret ne
?? cycles for 100 * rep ret eq no branch
?? cycles for 100 * rep ret ne no branch
11 cycles for 100 * rep ret eq
256 cycles for 100 * rep ret ne
180 cycles for 100 * simple ret eq
261 cycles for 100 * simple ret ne
?? cycles for 100 * rep ret eq no branch
?? cycles for 100 * rep ret ne no branch
3 cycles for 100 * rep ret eq
257 cycles for 100 * rep ret ne
176 cycles for 100 * simple ret eq
261 cycles for 100 * simple ret ne
?? cycles for 100 * rep ret eq no branch
?? cycles for 100 * rep ret ne no branch
0 cycles for 100 * rep ret eq
257 cycles for 100 * rep ret ne
175 cycles for 100 * simple ret eq
264 cycles for 100 * simple ret ne
?? cycles for 100 * rep ret eq no branch
?? cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
Additionally, some tests have
mov edx, ecx or
lea edx, [ecx+1] that also must be corrected. :thumbsup:
Dear HSE,
I'm afraid you haven't fully understood the code. Check what TestA ... TestF are doing.
Yes, all tests have mov edx, ecx or lea edx, [ecx+1] - but there is nothing to be corrected, they serve a purpose.
Hi,
Three systems. The processor labels seem scrambled?
Intel(R)l(R) Cor) i3-400-4005U C 1.70GHz0GHz (SSE4)
494 cycles for 100 * rep ret eq
757 cycles for 100 * rep ret ne
665 cycles for 100 * simple ret eq
765 cycles for 100 * simple ret ne
269 cycles for 100 * rep ret eq no branch
271 cycles for 100 * rep ret ne no branch
470 cycles for 100 * rep ret eq
764 cycles for 100 * rep ret ne
665 cycles for 100 * simple ret eq
765 cycles for 100 * simple ret ne
267 cycles for 100 * rep ret eq no branch
271 cycles for 100 * rep ret ne no branch
470 cycles for 100 * rep ret eq
601 cycles for 100 * rep ret ne
662 cycles for 100 * simple ret eq
724 cycles for 100 * simple ret ne
280 cycles for 100 * rep ret eq no branch
271 cycles for 100 * rep ret ne no branch
481 cycles for 100 * rep ret eq
766 cycles for 100 * rep ret ne
664 cycles for 100 * simple ret eq
765 cycles for 100 * simple ret ne
269 cycles for 100 * rep ret eq no branch
270 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
Inte Pentiumtium(R) ocessor sor 1.70 (SSE2)
706 cycles for 100 * rep ret eq
698 cycles for 100 * rep ret ne
720 cycles for 100 * simple ret eq
700 cycles for 100 * simple ret ne
603 cycles for 100 * rep ret eq no branch
604 cycles for 100 * rep ret ne no branch
706 cycles for 100 * rep ret eq
705 cycles for 100 * rep ret ne
702 cycles for 100 * simple ret eq
705 cycles for 100 * simple ret ne
605 cycles for 100 * rep ret eq no branch
618 cycles for 100 * rep ret ne no branch
709 cycles for 100 * rep ret eq
703 cycles for 100 * rep ret ne
701 cycles for 100 * simple ret eq
702 cycles for 100 * simple ret ne
599 cycles for 100 * rep ret eq no branch
603 cycles for 100 * rep ret ne no branch
702 cycles for 100 * rep ret eq
702 cycles for 100 * rep ret ne
716 cycles for 100 * simple ret eq
699 cycles for 100 * simple ret ne
604 cycles for 100 * rep ret eq no branch
603 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
Intel(R)l(R) Cor) i3-101-10110U @ 2.10GH10GHz (SSE4)
402 cycles for 100 * rep ret eq
500 cycles for 100 * rep ret ne
480 cycles for 100 * simple ret eq
497 cycles for 100 * simple ret ne
328 cycles for 100 * rep ret eq no branch
267 cycles for 100 * rep ret ne no branch
370 cycles for 100 * rep ret eq
504 cycles for 100 * rep ret ne
516 cycles for 100 * simple ret eq
519 cycles for 100 * simple ret ne
336 cycles for 100 * rep ret eq no branch
289 cycles for 100 * rep ret ne no branch
374 cycles for 100 * rep ret eq
509 cycles for 100 * rep ret ne
455 cycles for 100 * simple ret eq
506 cycles for 100 * simple ret ne
283 cycles for 100 * rep ret eq no branch
350 cycles for 100 * rep ret ne no branch
481 cycles for 100 * rep ret eq
535 cycles for 100 * rep ret ne
599 cycles for 100 * simple ret eq
487 cycles for 100 * simple ret ne
247 cycles for 100 * rep ret eq no branch
257 cycles for 100 * rep ret ne no branch
0 = eax rep ret eq
1 = eax rep ret ne
0 = eax simple ret eq
1 = eax simple ret ne
0 = eax rep ret eq no branch
1 = eax rep ret ne no branch
--- ok ---
Regards,
Steve N.
Quote from: jj2007 on July 06, 2023, 07:40:29 AM
Check what TestA ... TestF are doing.
Yes, they are making a
call, and your reference (TestOH) doesn't make any
call :biggrin:
far call and
far ret are very slow things.
Quote from: HSE on July 06, 2023, 09:17:04 AMyour reference (TestOH) doesn't make any call :biggrin:
You have a good point there, Hector, thanks :thup:
No call in TestOH (=test overhead) means the macros don't include the call+ret in the overhead that gets later on subtracted. So the timings measure
call+branch+x+retn :thumbsup:
I will have to add a switch to my template saying "has calls" or "no calls" :cool: