Invoke, call, jump. Simple benchmark

daydreamer · June 26, 2025, 02:25:35 PM

Quote from: NoCforMe on June 25, 2025, 08:54:24 AM
Quote from: daydreamer on June 24, 2025, 11:14:15 PM
Quote from: NoCforMe on June 24, 2025, 09:00:25 PMSo it's better not to align a proc???
I thought it should be like that with Align vs unaligned proc start,because innerloop some opcodes later ends up aligned

Excellent point; hadn't thought of that.

It's not necessarily the proc's entry point that you want aligned:
it's whatever instruction that marks the start of a time-sensitive part of the proc, a loop or whatever.
So the alignment should probably be done inside the proc, not outside.

Might wanna start thread with comparing Align 64 with a innerloop vs unaligned loop, where cpu constantly need to reload cache lines

jj2007 · June 26, 2025, 07:00:50 PM

Quote from: jj2007 on June 25, 2025, 01:14:58 AM
Quote from: daydreamer on June 25, 2025, 12:54:02 AMmysterious error,alignedinvoke.exe wont work
Just run it through Olly and tell me where it chokes. Btw no MasmBasic there, it's purest Masm32

Daydreamer, I am still waiting...

LordAdef · June 27, 2025, 05:04:23 AM

Quote from: NoCforMe on June 25, 2025, 08:54:24 AM
Quote from: daydreamer on June 24, 2025, 11:14:15 PM
Quote from: NoCforMe on June 24, 2025, 09:00:25 PMSo it's better not to align a proc???
I thought it should be like that with Align vs unaligned proc start,because innerloop some opcodes later ends up aligned

Excellent point; hadn't thought of that.

It's not necessarily the proc's entry point that you want aligned:
it's whatever instruction that marks the start of a time-sensitive part of the proc, a loop or whatever.
So the alignment should probably be done inside the proc, not outside.

Yep, interesting how this apparently obvious benchmark brings interesting results.

my _jmp macro aligns the jump caller and the returning one.

The _jmp is align 4, because it was my best "average" results. Aligning 16 got me better results, but not always

The other thing is how different it behaves when we move the target address within the code, or any changes we make in the prog. It seems there's not much to be done without either benchmarking the code all the time or to check the binary.

While writing the test prog, I was moving routines around and sometimes I got "invoke" faster than "call"

quick edit: I assumed NEAR jumps should give me faster results. In my machine it didn't make any difference. But... it should

NoCforMe · June 27, 2025, 06:06:57 AM

Quote from: LordAdef on June 27, 2025, 05:04:23 AMWhile writing the test prog, I was moving routines around and sometimes I got "invoke" faster than "call"

You don't seem to understand what invoke actually does.
It's a macro which behaves differently depending on whether the thing being invoked has parameters or not:

If the subroutine has parameters (and if there is a PROTOtype defined for that routine), invoke pushes the parameters from the "outside in", then does a CALL.
If the subroutine has no parameters, then invoke simply does a CALL.

So in the latter case there's no difference between invoke and CALL.

invoke is not a processor opcode.

TimoVJL · June 27, 2025, 07:11:52 PM

CPU cache suffers long jumps too ?
Also call flush cache ?

jj2007 · June 27, 2025, 10:42:23 PM

Quote from: NoCforMe on June 27, 2025, 06:06:57 AMYou don't seem to understand what invoke actually does.

I'm sure he does.

LordAdef · June 28, 2025, 01:25:10 AM

Quote from: jj2007 on June 27, 2025, 10:42:23 PM
Quote from: NoCforMe on June 27, 2025, 06:06:57 AMYou don't seem to understand what invoke actually does.

I'm sure he does.

Yep!

LordAdef · June 28, 2025, 01:25:46 AM

Quote from: TimoVJL on June 27, 2025, 07:11:52 PMCPU cache suffers long jumps too ?
Also call flush cache ?

good questions

NoCforMe · June 28, 2025, 05:05:18 AM

Quote from: jj2007 on June 27, 2025, 10:42:23 PM
Quote from: NoCforMe on June 27, 2025, 06:06:57 AMYou don't seem to understand what invoke actually does.

I'm sure he does.

OK, but then why would he suppose there's any difference between INVOKE and CALL?

It's not as if the macro does any alignment or anything like that.

LordAdef · June 28, 2025, 08:13:25 AM

Quote from: NoCforMe on June 28, 2025, 05:05:18 AM
Quote from: jj2007 on June 27, 2025, 10:42:23 PM
Quote from: NoCforMe on June 27, 2025, 06:06:57 AMYou don't seem to understand what invoke actually does.

I'm sure he does.

OK, but then why would he suppose there's any difference between INVOKE and CALL?

It's not as if the macro does any alignment or anything like that.

Because it doesn't matter.
Because I quickly tested pushing arguments, but left without it.
Because, if you read my first post, a new comer may have the misconception that call is always faster than invoke.
Because, you may try to use the test example and compare, if you like, an aligned invoke passing one arg, and try to beat an unaligned call. Whatever.

The point of this thread is alignment

NoCforMe · June 28, 2025, 08:33:26 AM

Quote from: LordAdef on June 28, 2025, 08:13:25 AMBecause, if you read my first post, a new comer may have the misconception that call is always faster than invoke.

I read your first post again; it says nothing at all about call vs. invoke.
However, it's a very good point to get across.

daydreamer · July 02, 2025, 06:58:12 PM

Today with old 32bit arguments vs 64 bit fastcall, put data in registers would be more interesting
32bit fastcall transfer data in registers vs 32 bit invoke pushing data would be most fair to test

jj2007 · July 03, 2025, 12:41:36 AM

Quote from: daydreamer on July 02, 2025, 06:58:12 PM32bit fastcall transfer data in registers vs 32 bit invoke pushing data would be most fair to test

Yep, you can save a cycle

Code Select

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

400    cycles for 100 * proc aligned 16
400    cycles for 100 * proc aligned 16+3
417    cycles for 100 * aligned push+pop
273    cycles for 100 * aligned reg32

405    cycles for 100 * proc aligned 16
409    cycles for 100 * proc aligned 16+3
426    cycles for 100 * aligned push+pop
276    cycles for 100 * aligned reg32

409    cycles for 100 * proc aligned 16
402    cycles for 100 * proc aligned 16+3
422    cycles for 100 * aligned push+pop
290    cycles for 100 * aligned reg32

403    cycles for 100 * proc aligned 16
406    cycles for 100 * proc aligned 16+3
426    cycles for 100 * aligned push+pop
278    cycles for 100 * aligned reg32

406    cycles for 100 * proc aligned 16
416    cycles for 100 * proc aligned 16+3
421    cycles for 100 * aligned push+pop
281    cycles for 100 * aligned reg32

15      bytes for proc aligned 16
19      bytes for proc aligned 16+3
24      bytes for aligned push+pop
20      bytes for aligned reg32

NoCforMe · July 03, 2025, 07:47:50 AM

Quote from: daydreamer on July 02, 2025, 06:58:12 PM32bit fastcall transfer data in registers vs 32 bit invoke pushing data would be most fair to test

I transfer arguments to (and from) subroutines in registers all the time in my own code.
No need to follow someone else's ABI when it's your own personal code that you can do what the hell what you like with.

Of course, I do follow the part of the Win32 ABI that requires you to respect the "sacred" registers (EBX, ESI, EDI).

daydreamer · July 03, 2025, 05:11:13 PM

@NoCforMe
best with transferring thru registers in your own code,if you prefer using fpu regs or xmm regs for your real4/real8 variables as coding style to your own PROC's

Code Select

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

322     cycles for 100 * proc aligned 16
267     cycles for 100 * proc aligned 16+3
392     cycles for 100 * aligned push+pop
392     cycles for 100 * aligned reg32

310     cycles for 100 * proc aligned 16
266     cycles for 100 * proc aligned 16+3
397     cycles for 100 * aligned push+pop
394     cycles for 100 * aligned reg32

308     cycles for 100 * proc aligned 16
269     cycles for 100 * proc aligned 16+3
408     cycles for 100 * aligned push+pop
392     cycles for 100 * aligned reg32

314     cycles for 100 * proc aligned 16
263     cycles for 100 * proc aligned 16+3
404     cycles for 100 * aligned push+pop
399     cycles for 100 * aligned reg32

308     cycles for 100 * proc aligned 16
267     cycles for 100 * proc aligned 16+3
395     cycles for 100 * aligned push+pop
391     cycles for 100 * aligned reg32

15      bytes for proc aligned 16
19      bytes for proc aligned 16+3
24      bytes for aligned push+pop
20      bytes for aligned reg32


-

The MASM Forum

News:

Invoke, call, jump. Simple benchmark

daydreamer

jj2007

LordAdef

NoCforMe

TimoVJL

jj2007

LordAdef

LordAdef

NoCforMe

LordAdef

NoCforMe

daydreamer

jj2007

NoCforMe

daydreamer