News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Zeroing local variables

Started by jj2007, April 02, 2022, 03:03:48 PM

Previous topic - Next topic

jj2007

This is a comparison of ClearLocals and StackBuffer for allocating 800k on the stack; the nops 4 stands for code using the buffers:

aa proc uses edi arg1
Local buffer[testbytes]:BYTE
  ClearLocals
  lea edi, buffer
  nops 4
  mov eax, [edi] ; return some result (here: zero)
  ret
aa endp

bb proc uses edi arg1
  mov edi, StackBuffer(testbytes)
  nops 4
  mov eax, [edi]
  StackBuffer()
  ret
bb endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7962    kCycles for 100 * ClearLocals (800kB)
6787    kCycles for 100 * StackBuffer(800kB, zeroed)
12      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7948    kCycles for 100 * ClearLocals (800kB)
6798    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7969    kCycles for 100 * ClearLocals (800kB)
6854    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7938    kCycles for 100 * ClearLocals (800kB)
6855    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)


The third case allocates the buffer on the stack without zeroing. Below 1800 bytes of local variables, on my machine ClearLocals is faster than StackBuffer().

LiaoMi

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

2418    kCycles for 100 * ClearLocals (800kB)
2269    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
127     kCycles for 100 * Locals only (with probing)

2382    kCycles for 100 * ClearLocals (800kB)
2284    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
126     kCycles for 100 * Locals only (with probing)

2432    kCycles for 100 * ClearLocals (800kB)
2297    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
130     kCycles for 100 * Locals only (with probing)

2487    kCycles for 100 * ClearLocals (800kB)
2279    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
126     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)


--- ok ---

jj2007

Wow, that's a factor 3 faster than my i5 :thumbsup:

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

6845    kCycles for 100 * ClearLocals (800kB)
8823    kCycles for 100 * StackBuffer(800kB, zeroed)
18      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
667     kCycles for 100 * Locals only (with probing)

8667    kCycles for 100 * ClearLocals (800kB)
6484    kCycles for 100 * StackBuffer(800kB, zeroed)
26      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
913     kCycles for 100 * Locals only (with probing)

8497    kCycles for 100 * ClearLocals (800kB)
9141    kCycles for 100 * StackBuffer(800kB, zeroed)
18      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
681     kCycles for 100 * Locals only (with probing)

9024    kCycles for 100 * ClearLocals (800kB)
8301    kCycles for 100 * StackBuffer(800kB, zeroed)
25      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
671     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)
May the source be with you

Vortex

Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz (SSE4)

10079   kCycles for 100 * ClearLocals (800kB)
9652    kCycles for 100 * StackBuffer(800kB, zeroed)
11      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
301     kCycles for 100 * Locals only (with probing)

9990    kCycles for 100 * ClearLocals (800kB)
9532    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
302     kCycles for 100 * Locals only (with probing)

9914    kCycles for 100 * ClearLocals (800kB)
9529    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
304     kCycles for 100 * Locals only (with probing)

10003   kCycles for 100 * ClearLocals (800kB)
9635    kCycles for 100 * StackBuffer(800kB, zeroed)
11      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
305     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)

jj2007

Thanks, Timo and Erol. So the AMD is happier with ClearLocals... strange. Under the hood it's rep stosd in both cases.

HSE

Hi JJ!

For what thing you make locals zero?
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on April 05, 2022, 08:52:49 AM
For what thing you make locals zero?

Hi Hector,

It's convenient in many cases to rely on that: think of flags, counters, handles, etc. In the RichMasm source, for example, I count around 40 procs starting with ClearLocals.

I use StackBuffer() more sparingly, and rarely the _Local macro which initialises variables to some value:

include \masm32\MasmBasic\MasmBasic.inc
.code
MyTest proc uses edi esi ebx arg1:DWORD, arg2, TheString
  LOCAL v1, v2, rc:RECT, buffer[100]:BYTE ; ordinary Locals first
  _Local v3=123, v4:REAL4=123.456
  _Local x$="Hello World", y$=TheString ; strings can be initialised, too
  ClearLocals ; first line after the LOCALs
  deb 1, "Perfect:", v1, v2, v3, v4, $x$, $TheString
ret
MyTest endp
Init
  invoke MyTest, 123, 456, Chr$("String passed")
EndOfCode


ToutEnMasm had his own version called ZEROLOCALES back in 2009, but it was more complicated to use. My attempts started in 2008 as "call ClearLocals", and I did it simply because I was used to it: GfaBasic had that feature as early as 1987 :cool:

HSE

Quote from: jj2007 on April 05, 2022, 09:13:52 AM
I did it simply because I was used to it: GfaBasic had that feature as early as 1987 :cool:

That is  :thumbsup:

For me look more interesting _Local  :biggrin:

Thanks!!

Equations in Assembly: SmplMath

daydreamer

behaviour under the hood on cpp using local arrays data is copy data to local arrays, how is copy speed compared to zeroing speed?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

When working with the ClearLocals macro, I stumbled over a little peculiarity: to benefit from assignments inside the prologue macro, such as mybytes=localbytes, the following macro (in this case ClearLocal) needs one instruction before getting access to mybytes. So I had to insert a nop at the top of ClearLocals. Out of curiosity, I added a benchmark for nops to the testbed - results below. So one nop costs 0.1 cycles, 10 cost 0.15 cycles per nop, and 100 nops arrive at 0.20 cycles per nop. That looks pretty odd, and I'd appreciate some timings :rolleyes:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

12019   cycles for 100 * ClearLocals (fast+bloated)
14413   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
146     cycles for 100 * 10 nops
1999    cycles for 100 * 100 nops

12022   cycles for 100 * ClearLocals (fast+bloated)
14384   cycles for 100 * ClearLocals (slow+compact)
??      cycles for 100 * one nop
146     cycles for 100 * 10 nops
1997    cycles for 100 * 100 nops

11918   cycles for 100 * ClearLocals (fast+bloated)
14349   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
145     cycles for 100 * 10 nops
1999    cycles for 100 * 100 nops

11939   cycles for 100 * ClearLocals (fast+bloated)
14532   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
144     cycles for 100 * 10 nops
1997    cycles for 100 * 100 nops

51      bytes for ClearLocals (fast+bloated)
39      bytes for ClearLocals (slow+compact)
1       bytes for one nop
10      bytes for 10 nops
100     bytes for 100 nops

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

7990    cycles for 100 * ClearLocals (fast+bloated)
8635    cycles for 100 * ClearLocals (slow+compact)
3       cycles for 100 * one nop
144     cycles for 100 * 10 nops
2041    cycles for 100 * 100 nops

7961    cycles for 100 * ClearLocals (fast+bloated)
8638    cycles for 100 * ClearLocals (slow+compact)
1       cycles for 100 * one nop
144     cycles for 100 * 10 nops
2041    cycles for 100 * 100 nops

7961    cycles for 100 * ClearLocals (fast+bloated)
8633    cycles for 100 * ClearLocals (slow+compact)
0       cycles for 100 * one nop
144     cycles for 100 * 10 nops
2039    cycles for 100 * 100 nops

7959    cycles for 100 * ClearLocals (fast+bloated)
8633    cycles for 100 * ClearLocals (slow+compact)
0       cycles for 100 * one nop
144     cycles for 100 * 10 nops
2039    cycles for 100 * 100 nops

51      bytes for ClearLocals (fast+bloated)
39      bytes for ClearLocals (slow+compact)
1       bytes for one nop
10      bytes for 10 nops
100     bytes for 100 nops

FORTRANS

Hi,

   Two systems.  The first seems to scale nops sort of  linearly.

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

84974 cycles for 100 * ClearLocals (fast+bloated)
89782 cycles for 100 * ClearLocals (slow+compact)
52 cycles for 100 * one nop
493 cycles for 100 * 10 nops
5010 cycles for 100 * 100 nops

83083 cycles for 100 * ClearLocals (fast+bloated)
89693 cycles for 100 * ClearLocals (slow+compact)
52 cycles for 100 * one nop
507 cycles for 100 * 10 nops
5357 cycles for 100 * 100 nops

82917 cycles for 100 * ClearLocals (fast+bloated)
90803 cycles for 100 * ClearLocals (slow+compact)
52 cycles for 100 * one nop
493 cycles for 100 * 10 nops
4999 cycles for 100 * 100 nops

82903 cycles for 100 * ClearLocals (fast+bloated)
91068 cycles for 100 * ClearLocals (slow+compact)
52 cycles for 100 * one nop
506 cycles for 100 * 10 nops
5360 cycles for 100 * 100 nops

51 bytes for ClearLocals (fast+bloated)
39 bytes for ClearLocals (slow+compact)
1 bytes for one nop
10 bytes for 10 nops
100 bytes for 100 nops


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

9658 cycles for 100 * ClearLocals (fast+bloated)
10539 cycles for 100 * ClearLocals (slow+compact)
14 cycles for 100 * one nop
185 cycles for 100 * 10 nops
2482 cycles for 100 * 100 nops

11094 cycles for 100 * ClearLocals (fast+bloated)
10480 cycles for 100 * ClearLocals (slow+compact)
11 cycles for 100 * one nop
185 cycles for 100 * 10 nops
2495 cycles for 100 * 100 nops

9677 cycles for 100 * ClearLocals (fast+bloated)
10472 cycles for 100 * ClearLocals (slow+compact)
?? cycles for 100 * one nop
185 cycles for 100 * 10 nops
2482 cycles for 100 * 100 nops

9672 cycles for 100 * ClearLocals (fast+bloated)
10517 cycles for 100 * ClearLocals (slow+compact)
15 cycles for 100 * one nop
185 cycles for 100 * 10 nops
2482 cycles for 100 * 100 nops

51 bytes for ClearLocals (fast+bloated)
39 bytes for ClearLocals (slow+compact)
1 bytes for one nop
10 bytes for 10 nops
100 bytes for 100 nops


--- ok ---

jj2007

Quote from: FORTRANS on April 06, 2022, 06:27:52 AM
Two systems.  The first seems to scale nops sort of  linearly.

Yep, that's it. And the first one is a Banias 130nm cpu, the second one a much more recent Haswell 22nm cpu. So it seems that on modern cpus, you can happily ignore (speedwise) a single nop.

In particular, it doesn't make any difference if there is a nop before a call:
9       cycles for 100 * one nop
0       cycles for 100 * no nops
1998    cycles for 100 * 100 nops
378     cycles for 100 * one nop + call
378     cycles for 100 * no nop + call

9       cycles for 100 * one nop
0       cycles for 100 * no nops
1998    cycles for 100 * 100 nops
378     cycles for 100 * one nop + call
378     cycles for 100 * no nop + call


  align 4  ; nop+call
  .Repeat
   nop
   call dummy
   dec ebx
  .Until Sign?

  align 4  ; no nop+call
  .Repeat
   call dummy
   dec ebx
  .Until Sign?

Note that without the call, the nop takes 0.09 cycles; with the call, it gets completely absorbed.

hutch--

> Note that without the call, the nop takes 0.09 cycles; with the call, it gets completely absorbed.

That is pretty much the case, vaguely I think its called shadowing which means the 1 byte NOP hides in the space of a preceding instruction.