Zeroing local variables

jj2007 · April 02, 2022, 03:03:48 PM

This is a comparison of ClearLocals and StackBuffer for allocating 800k on the stack; the nops 4 stands for code using the buffers:

Code Select

aa proc uses edi arg1
Local buffer[testbytes]:BYTE
  ClearLocals
  lea edi, buffer
  nops 4
  mov eax, [edi]		; return some result (here: zero)
  ret
aa endp

Code Select

bb proc uses edi arg1
  mov edi, StackBuffer(testbytes)
  nops 4
  mov eax, [edi]
  StackBuffer()
  ret
bb endp

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7962    kCycles for 100 * ClearLocals (800kB)
6787    kCycles for 100 * StackBuffer(800kB, zeroed)
12      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7948    kCycles for 100 * ClearLocals (800kB)
6798    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7969    kCycles for 100 * ClearLocals (800kB)
6854    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

7938    kCycles for 100 * ClearLocals (800kB)
6855    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)

The third case allocates the buffer on the stack without zeroing. Below 1800 bytes of local variables, on my machine ClearLocals is faster than StackBuffer().

LiaoMi · April 04, 2022, 02:46:36 AM

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

2418    kCycles for 100 * ClearLocals (800kB)
2269    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
127     kCycles for 100 * Locals only (with probing)

2382    kCycles for 100 * ClearLocals (800kB)
2284    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
126     kCycles for 100 * Locals only (with probing)

2432    kCycles for 100 * ClearLocals (800kB)
2297    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
130     kCycles for 100 * Locals only (with probing)

2487    kCycles for 100 * ClearLocals (800kB)
2279    kCycles for 100 * StackBuffer(800kB, zeroed)
3       kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
126     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)


--- ok ---

jj2007 · April 04, 2022, 06:31:14 AM

Wow, that's a factor 3 faster than my i5

TimoVJL · April 04, 2022, 04:12:33 PM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

6845    kCycles for 100 * ClearLocals (800kB)
8823    kCycles for 100 * StackBuffer(800kB, zeroed)
18      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
667     kCycles for 100 * Locals only (with probing)

8667    kCycles for 100 * ClearLocals (800kB)
6484    kCycles for 100 * StackBuffer(800kB, zeroed)
26      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
913     kCycles for 100 * Locals only (with probing)

8497    kCycles for 100 * ClearLocals (800kB)
9141    kCycles for 100 * StackBuffer(800kB, zeroed)
18      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
681     kCycles for 100 * Locals only (with probing)

9024    kCycles for 100 * ClearLocals (800kB)
8301    kCycles for 100 * StackBuffer(800kB, zeroed)
25      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
671     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)

Vortex · April 05, 2022, 06:39:22 AM

Code Select

Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz (SSE4)

10079   kCycles for 100 * ClearLocals (800kB)
9652    kCycles for 100 * StackBuffer(800kB, zeroed)
11      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
301     kCycles for 100 * Locals only (with probing)

9990    kCycles for 100 * ClearLocals (800kB)
9532    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
302     kCycles for 100 * Locals only (with probing)

9914    kCycles for 100 * ClearLocals (800kB)
9529    kCycles for 100 * StackBuffer(800kB, zeroed)
13      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
304     kCycles for 100 * Locals only (with probing)

10003   kCycles for 100 * ClearLocals (800kB)
9635    kCycles for 100 * StackBuffer(800kB, zeroed)
11      kCycles for 100 * ClearLocals 1kB plus StackBuffer(800k, no zeroing)
305     kCycles for 100 * Locals only (with probing)

71      bytes for ClearLocals (800kB)
35      bytes for StackBuffer(800kB, zeroed)
103     bytes for ClearLocals 1kB plus StackBuffer(800k, no zeroing)
51      bytes for Locals only (with probing)

jj2007 · April 05, 2022, 07:00:02 AM

Thanks, Timo and Erol. So the AMD is happier with ClearLocals... strange. Under the hood it's rep stosd in both cases.

HSE · April 05, 2022, 08:52:49 AM

Hi JJ!

For what thing you make locals zero?

jj2007 · April 05, 2022, 09:13:52 AM

Quote from: HSE on April 05, 2022, 08:52:49 AM
For what thing you make locals zero?

Hi Hector,

It's convenient in many cases to rely on that: think of flags, counters, handles, etc. In the RichMasm source, for example, I count around 40 procs starting with ClearLocals.

I use StackBuffer() more sparingly, and rarely the _Local macro which initialises variables to some value:

Code Select

include \masm32\MasmBasic\MasmBasic.inc
.code
MyTest proc uses edi esi ebx arg1:DWORD, arg2, TheString
  LOCAL v1, v2, rc:RECT, buffer[100]:BYTE			; ordinary Locals first
  _Local v3=123, v4:REAL4=123.456
  _Local x$="Hello World", y$=TheString				; strings can be initialised, too
  ClearLocals							; first line after the LOCALs
  deb 1, "Perfect:", v1, v2, v3, v4, $x$, $TheString
ret
MyTest endp
Init
  invoke MyTest, 123, 456, Chr$("String passed")
EndOfCode

ToutEnMasm had his own version called ZEROLOCALES back in 2009, but it was more complicated to use. My attempts started in 2008 as "call ClearLocals", and I did it simply because I was used to it: GfaBasic had that feature as early as 1987

HSE · April 05, 2022, 10:38:30 AM

Quote from: jj2007 on April 05, 2022, 09:13:52 AM
I did it simply because I was used to it: GfaBasic had that feature as early as 1987

That is

For me look more interesting _Local

Thanks!!

daydreamer · April 05, 2022, 09:53:35 PM

behaviour under the hood on cpp using local arrays data is copy data to local arrays, how is copy speed compared to zeroing speed?

jj2007 · April 06, 2022, 05:24:28 AM

When working with the ClearLocals macro, I stumbled over a little peculiarity: to benefit from assignments inside the prologue macro, such as mybytes=localbytes, the following macro (in this case ClearLocal) needs one instruction before getting access to mybytes. So I had to insert a nop at the top of ClearLocals. Out of curiosity, I added a benchmark for nops to the testbed - results below. So one nop costs 0.1 cycles, 10 cost 0.15 cycles per nop, and 100 nops arrive at 0.20 cycles per nop. That looks pretty odd, and I'd appreciate some timings

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

12019   cycles for 100 * ClearLocals (fast+bloated)
14413   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
146     cycles for 100 * 10 nops
1999    cycles for 100 * 100 nops

12022   cycles for 100 * ClearLocals (fast+bloated)
14384   cycles for 100 * ClearLocals (slow+compact)
??      cycles for 100 * one nop
146     cycles for 100 * 10 nops
1997    cycles for 100 * 100 nops

11918   cycles for 100 * ClearLocals (fast+bloated)
14349   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
145     cycles for 100 * 10 nops
1999    cycles for 100 * 100 nops

11939   cycles for 100 * ClearLocals (fast+bloated)
14532   cycles for 100 * ClearLocals (slow+compact)
10      cycles for 100 * one nop
144     cycles for 100 * 10 nops
1997    cycles for 100 * 100 nops

51      bytes for ClearLocals (fast+bloated)
39      bytes for ClearLocals (slow+compact)
1       bytes for one nop
10      bytes for 10 nops
100     bytes for 100 nops

hutch-- · April 06, 2022, 06:02:43 AM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

7990 cycles for 100 * ClearLocals (fast+bloated)
8635 cycles for 100 * ClearLocals (slow+compact)
3 cycles for 100 * one nop
144 cycles for 100 * 10 nops
2041 cycles for 100 * 100 nops

7961 cycles for 100 * ClearLocals (fast+bloated)
8638 cycles for 100 * ClearLocals (slow+compact)
1 cycles for 100 * one nop
144 cycles for 100 * 10 nops
2041 cycles for 100 * 100 nops

7961 cycles for 100 * ClearLocals (fast+bloated)
8633 cycles for 100 * ClearLocals (slow+compact)
0 cycles for 100 * one nop
144 cycles for 100 * 10 nops
2039 cycles for 100 * 100 nops

7959 cycles for 100 * ClearLocals (fast+bloated)
8633 cycles for 100 * ClearLocals (slow+compact)
0 cycles for 100 * one nop
144 cycles for 100 * 10 nops
2039 cycles for 100 * 100 nops

51 bytes for ClearLocals (fast+bloated)
39 bytes for ClearLocals (slow+compact)
1 bytes for one nop
10 bytes for 10 nops
100 bytes for 100 nops

FORTRANS · April 06, 2022, 06:27:52 AM

Hi,

Two systems. The first seems to scale nops sort of linearly.

Code Select

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

84974	cycles for 100 * ClearLocals (fast+bloated)
89782	cycles for 100 * ClearLocals (slow+compact)
52	cycles for 100 * one nop
493	cycles for 100 * 10 nops
5010	cycles for 100 * 100 nops

83083	cycles for 100 * ClearLocals (fast+bloated)
89693	cycles for 100 * ClearLocals (slow+compact)
52	cycles for 100 * one nop
507	cycles for 100 * 10 nops
5357	cycles for 100 * 100 nops

82917	cycles for 100 * ClearLocals (fast+bloated)
90803	cycles for 100 * ClearLocals (slow+compact)
52	cycles for 100 * one nop
493	cycles for 100 * 10 nops
4999	cycles for 100 * 100 nops

82903	cycles for 100 * ClearLocals (fast+bloated)
91068	cycles for 100 * ClearLocals (slow+compact)
52	cycles for 100 * one nop
506	cycles for 100 * 10 nops
5360	cycles for 100 * 100 nops

51	bytes for ClearLocals (fast+bloated)
39	bytes for ClearLocals (slow+compact)
1	bytes for one nop
10	bytes for 10 nops
100	bytes for 100 nops


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

9658	cycles for 100 * ClearLocals (fast+bloated)
10539	cycles for 100 * ClearLocals (slow+compact)
14	cycles for 100 * one nop
185	cycles for 100 * 10 nops
2482	cycles for 100 * 100 nops

11094	cycles for 100 * ClearLocals (fast+bloated)
10480	cycles for 100 * ClearLocals (slow+compact)
11	cycles for 100 * one nop
185	cycles for 100 * 10 nops
2495	cycles for 100 * 100 nops

9677	cycles for 100 * ClearLocals (fast+bloated)
10472	cycles for 100 * ClearLocals (slow+compact)
??	cycles for 100 * one nop
185	cycles for 100 * 10 nops
2482	cycles for 100 * 100 nops

9672	cycles for 100 * ClearLocals (fast+bloated)
10517	cycles for 100 * ClearLocals (slow+compact)
15	cycles for 100 * one nop
185	cycles for 100 * 10 nops
2482	cycles for 100 * 100 nops

51	bytes for ClearLocals (fast+bloated)
39	bytes for ClearLocals (slow+compact)
1	bytes for one nop
10	bytes for 10 nops
100	bytes for 100 nops


--- ok ---

jj2007 · April 06, 2022, 07:24:29 AM

Quote from: FORTRANS on April 06, 2022, 06:27:52 AM
Two systems. The first seems to scale nops sort of linearly.

Yep, that's it. And the first one is a Banias 130nm cpu, the second one a much more recent Haswell 22nm cpu. So it seems that on modern cpus, you can happily ignore (speedwise) a single nop.

In particular, it doesn't make any difference if there is a nop before a call:

Code Select

9       cycles for 100 * one nop
0       cycles for 100 * no nops
1998    cycles for 100 * 100 nops
378     cycles for 100 * one nop + call
378     cycles for 100 * no nop + call

9       cycles for 100 * one nop
0       cycles for 100 * no nops
1998    cycles for 100 * 100 nops
378     cycles for 100 * one nop + call
378     cycles for 100 * no nop + call

align 4 ; nop+call
.Repeat
   nop
   call dummy
   dec ebx
.Until Sign?

align 4 ; no nop+call
.Repeat
   call dummy
   dec ebx
.Until Sign?

Note that without the call, the nop takes 0.09 cycles; with the call, it gets completely absorbed.

hutch-- · April 06, 2022, 10:34:40 AM

> Note that without the call, the nop takes 0.09 cycles; with the call, it gets completely absorbed.

That is pretty much the case, vaguely I think its called shadowing which means the 1 byte NOP hides in the space of a preceding instruction.

The MASM Forum

News:

Zeroing local variables

jj2007

LiaoMi

jj2007

TimoVJL

Vortex

jj2007

HSE

jj2007

HSE

daydreamer

jj2007

hutch--

FORTRANS

jj2007

hutch--