News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Problem with LOCAL variables

Started by Gwyn, February 20, 2013, 08:23:57 AM

Previous topic - Next topic

jj2007

Quote from: qWord on February 21, 2013, 05:57:11 AMit can also be assumed that locals are always cached

Is that a valid argument, given that you have to write to them before you use them?

MichaelW

Even assuming that the processors don't provide preferential caching of the stack, the active stack would likely be cached because it's frequently accessed.
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Right, the garbage that is in your local variables is probably cached.

But the good values that you have in global variables is also probably cached, and you don't have to write to the global memory each and every time before using them. So they are faster on average.

MichaelW

Good point, but consider that caching also works for write accesses. So for locals the initial write would likely be cached, where for the first of a localized group of globals the initial read would likely be uncached. Which is faster would depend on your global access patterns, but I agree that initialized globals are likely to be faster on average.
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Quote from: MichaelW on February 21, 2013, 12:15:26 PM
Good point, but consider that caching also works for write accesses. So for locals the initial write would likely be cached, where for the first of a group of globals the initial read would likely be uncached.

For a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache. But all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.

qWord

Quote from: jj2007 on February 21, 2013, 12:31:12 PMFor a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache.
... and all that stuff from the other processes and threads.

Quote from: jj2007 on February 21, 2013, 12:31:12 PMBut all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.
no, IIRC we commonly have write-back catch for user-land memory, thus a write back to phy. mem. only occurs if the catch is full, some kind of synchronization applies or the catch control decides that the region is no longer needed.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on February 21, 2013, 12:47:42 PM
Quote from: jj2007 on February 21, 2013, 12:31:12 PMFor a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache.
... and all that stuff from the other processes and threads.

Basically, when you context switch, all of the memory addresses that the processor "remembers" in it's cache effectively become useless.


Quote
Quote from: jj2007 on February 21, 2013, 12:31:12 PMBut all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.
no, IIRC we commonly have write-back catch for user-land memory, thus a write back to phy. mem. only occurs if the catch is full, some kind of synchronization applies or the catch control decides that the region is no longer needed.

"a write back to phy. mem. only occurs if the catch cache is full" may be correct but is irrelevant. Your LOCAL rc:RECT in the WndProc may be at a different address every time you write to it. Remember that on entry to the WndProc, esp varies.

MichaelW

Quote from: jj2007 on February 21, 2013, 01:26:02 PM
Your LOCAL rc:RECT in the WndProc may be at a different address every time you write to it. Remember that on entry to the WndProc, esp varies.

A different address, but still likely a cached address.

Well Microsoft, here's another nice mess you've gotten us into.

qWord

Quote from: jj2007 on February 21, 2013, 01:26:02 PMBasically, when you context switch, all of the memory addresses that the processor "remembers" in it's cache effectively become useless.
That make not much sense, because caches also works with physical address and not only with virtual addresses. Also, looking in Intel's manuals, you will see that there are solutions for the problem of different address spaces.
Even thought that theory would make the large caches that are nowadays used become useless, because that would mean the whole cache needs to be copied for each of the thousands context switches per second.
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

this is a bit beyond the campus   :P

but, i would think the internal cache is somehow shared between contexts
an external cache can be switched with the context by changing page table entries

MichaelW

#25
If the first use for a local is as a storage destination, as you would use a local PAINTSTRUCT for example, then there is no initialization penalty. This code compares the access times for globals and locals,  sequential access and random access:

;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
;--------------------------------------------------------
; This is an assembly-time random number generator based
; on code by George Marsaglia:
;   #define znew  ((z=36969*(z&65535)+(z>>16))<<16)
;   #define wnew  ((w=18000*(w&65535)+(w>>16))&65535)
;   #define MWC   (znew+wnew)
;--------------------------------------------------------
@znew_seed@ = 362436069
@wnew_seed@ = 521288629

@rnd MACRO base:REQ
    LOCAL znew, wnew
    @znew_seed@ = 36969 * (@znew_seed@ AND 65535) + (@znew_seed@ SHR 16)
    znew = @znew_seed@ SHL 16
    @wnew_seed@ = 18000 * (@wnew_seed@ AND 65535) + (@wnew_seed@ SHR 16)
    wnew = @wnew_seed@ AND 65535
    EXITM <(znew + wnew) MOD base>
ENDM
;==============================================================================
ARRAY_SIZE equ 100
;==============================================================================
.data
    array1  dd ARRAY_SIZE dup(?)
.code
;==============================================================================

align 4
globals_seq proc
    lea esi, array1
    xor ecx, ecx
  @@:
    mov eax, [esi+ecx*4]
    inc ecx
    cmp ecx, ARRAY_SIZE
    jb  @B
    ret
globals_seq endp

align 4
locals_seq proc
    LOCAL array2[ARRAY_SIZE]:DWORD
    lea esi, array2
    xor ecx, ecx
  @@:
    mov eax, [esi+ecx*4]
    inc ecx
    cmp ecx, ARRAY_SIZE
    jb  @B
    ret
locals_seq endp

align 4
globals_rnd proc
    lea esi, array1
    REPEAT ARRAY_SIZE
        mov eax, @rnd(ARRAY_SIZE)
        mov eax, [esi+eax*4]
    ENDM
    ret
globals_rnd endp

align 4
locals_rnd proc
    LOCAL array2[ARRAY_SIZE]:DWORD
    lea ebx, array2
    REPEAT ARRAY_SIZE
        mov eax, @rnd(ARRAY_SIZE)
        mov eax, [esi+eax*4]
    ENDM
    ret
locals_rnd endp

;==============================================================================
start:
;==============================================================================
    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    REPEAT 100
        mov eax, @rnd(ARRAY_SIZE)
        printf("%d\t",eax)
    ENDM
    printf("\n")

    invoke Sleep, 5000

    REPEAT 3

        counter_begin 10000000, HIGH_PRIORITY_CLASS
        counter_end
        printf("%d cycles, empty\n", eax)

        counter_begin 10000000, HIGH_PRIORITY_CLASS
            call globals_seq
        counter_end
        printf("%d cycles, globals_seq\n", eax)

        counter_begin 10000000, HIGH_PRIORITY_CLASS
            call locals_seq
        counter_end
        printf("%d cycles, locals_seq\n", eax)

        counter_begin 10000000, HIGH_PRIORITY_CLASS
            call globals_rnd
        counter_end
        printf("%d cycles, globals_rnd\n", eax)

        counter_begin 10000000, HIGH_PRIORITY_CLASS
            call locals_rnd
        counter_end
        printf("%d cycles, locals_rnd\n\n", eax)

    ENDM

    inkey
    exit
;==============================================================================
end start


Running on a P3 (Katmai):

0 cycles, empty
310 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd

0 cycles, empty
311 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd

0 cycles, empty
310 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd


Running on a P4 (Northwood):

1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
89 cycles, locals_rnd

1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
96 cycles, locals_rnd

1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
89 cycles, locals_rnd


For the P3, increasing the array size to 200 elements or dropping it to 30 elements increased the cycle count for the local array sequential access, putting it close to the count for the global array, I suspect because of caching effects.


Well Microsoft, here's another nice mess you've gotten us into.

jj2007

#26
I've looked at it from a different angle:
- local vars, need to be initialised
- global vars, either static or assigned in proc
- random stack as in a normal WndProc

Results:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 433/200 cycles

2081    cycles for 200 * local
979     cycles for 200 * global no init
1068    cycles for 200 * global w init
19630   cycles for 200 * local, random stack
18900   cycles for 200 * global w init, random stack
18689   cycles for 200 * global no init, random stack

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3
loop overhead is approx. 355/200 cycles

1697    cycles for 200 * local
1059    cycles for 200 * global no init
1265    cycles for 200 * global w init
18163   cycles for 200 * local, random stack
16680   cycles for 200 * global w init, random stack
16893   cycles for 200 * global no init, random stack


The "random stack" results are distorted because nrandom is very slow. There is an option useMB=1 that switches to MasmBasic's fast Rand(). Example:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 356/200 cycles

1696    cycles for 200 * local
1058    cycles for 200 * global no init
1264    cycles for 200 * global w init
5281    cycles for 200 * local, random stack
4295    cycles for 200 * global w init, random stack
4731    cycles for 200 * global no init, random stack

qWord

#27
Quote from: MichaelW on February 21, 2013, 05:42:58 PMThis code compares the access times for globals and locals,  sequential access and random access:
If I modifie the test bed, thus the Sleep in moved into the REPAT loop, the loop count is ten and the repetition count of the counter-macro is one, I get the following results, (Intel i7 3610QM):
Press any key to continue ...
94 50 70 51 17 51 90 81 74 73 92 11 25 99 90 40 31 23 25 64 64 57 99 27 18 40 59 0 49 3 23 85 30 7 13 86 10 12 31 49 71 1 30 58 71 24 40 46 50 84 84 26 1 9 81 29 10 80 49 82 58 18 38 59 61 64 31 42 18 20 27 11 21 74 73 6 42 33 98 28 60 57 62 88 89 5 76 75 1 29 47 91 95 24 35 38 17 63 32 38

-134 cycles, empty

546 cycles, globals_seq

186 cycles, locals_seq

906 cycles, globals_rnd

210 cycles, locals_rnd



-32 cycles, empty

938 cycles, globals_seq

389 cycles, locals_seq

1608 cycles, globals_rnd

453 cycles, locals_rnd



0 cycles, empty

1244 cycles, globals_seq

435 cycles, locals_seq

1288 cycles, globals_rnd

354 cycles, locals_rnd



7 cycles, empty

1334 cycles, globals_seq

561 cycles, locals_seq

1270 cycles, globals_rnd

338 cycles, locals_rnd



0 cycles, empty

904 cycles, globals_seq

577 cycles, locals_seq

1603 cycles, globals_rnd

373 cycles, locals_rnd



30 cycles, empty

1113 cycles, globals_seq

621 cycles, locals_seq

1247 cycles, globals_rnd

704 cycles, locals_rnd



-30 cycles, empty

1012 cycles, globals_seq

423 cycles, locals_seq

805 cycles, globals_rnd

522 cycles, locals_rnd



-44 cycles, empty

989 cycles, globals_seq

614 cycles, locals_seq

1270 cycles, globals_rnd

989 cycles, locals_rnd



-2 cycles, empty

952 cycles, globals_seq

550 cycles, locals_seq

858 cycles, globals_rnd

492 cycles, locals_rnd



-53 cycles, empty

948 cycles, globals_seq

573 cycles, locals_seq

1290 cycles, globals_rnd

522 cycles, locals_rnd


We should not blend out cache misses by using high loop counts.

the result for the unmodified test:
94      50      70      51      17      51      90      81      74      73
92      11      25      99      90      40      31      23      25      64
64      57      99      27      18      40      59      0       49      3
23      85      30      7       13      86      10      12      31      49
71      1       30      58      71      24      40      46      50      84
84      26      1       9       81      29      10      80      49      82
58      18      38      59      61      64      31      42      18      20
27      11      21      74      73      6       42      33      98      28
60      57      62      88      89      5       76      75      1       29
47      91      95      24      35      38      17      63      32      38

-5 cycles, empty
112 cycles, globals_seq
112 cycles, locals_seq
33 cycles, globals_rnd
34 cycles, locals_rnd

0 cycles, empty
113 cycles, globals_seq
112 cycles, locals_seq
34 cycles, globals_rnd
34 cycles, locals_rnd

0 cycles, empty
112 cycles, globals_seq
112 cycles, locals_seq
33 cycles, globals_rnd
34 cycles, locals_rnd

Press any key to continue ...
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

globals can be faster, because we often use absolute-direct addressing with them
    mov     eax,GlobalVar
not to mention, when a local is created, the routine has to adjust the stack - lol

passing the address of a global can also be faster - you push a constant
for a local, the assembler uses LEA, then PUSH EAX
PUSH EAX is fast, but the fact that you also have an LEA hurts

MichaelW

#29
I think this thread may need to be split.

I used an assembly-time RNG so I could avoid an overhead count of 20-30 times the count for the array access.

The stack adjustment to make room for the locals is fast.

And I failed to consider that the high loop count would "blend out" cache misses, when the only thing I can see that could make locals faster is a reduction in cache misses.

This is a quick modification of the cycle count macros to allow control of the thread priority, the idea being to get consistent counts in only a small number of loops, and hopefully only one loop. I cannot reasonably run at the highest possible priority on my P3, and on my P4 w HT I had to use a loop count of 100 to get even reasonable consistency. Perhaps the newer processors, with multiple physical cores, can do better.

  ; ----------------------------------------------------------------------
  ; These two macros perform the grunt work involved in measuring the
  ; processor clock cycle count for a block of code. These macros must
  ; be used in pairs, and the block of code must be placed in between
  ; the counter_begin and counter_end macro calls. The counter_end macro
  ; returns the clock cycle count for a single pass through the block of
  ; code, corrected for the test loop overhead, in EAX.
  ;
  ; These macros require a .586 or higher processor directive.
  ;
  ; The essential differences between these macros and the prvious macros
  ; are that these save and restore the original priorities, and provide
  ; a way to control the thread priority. Control of the thread priority
  ; allows timing code at the highest possible priority by combining
  ; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
  ;
  ; Note that running at the higher priority settings on a single core
  ; processor involves some risk, as it will cause your process to
  ; preempt *all* other processes, including critical Windows processes.
  ; Using HIGH_PRIORITY_CLASS in combination with THREAD_PRIORITY_NORMAL
  ; should generally be safe.
  ; ----------------------------------------------------------------------

    counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
        LOCAL label

        IFNDEF __counter__qword__count__
          .data
          ALIGN 8             ;; Optimal alignment for QWORD
            __counter__qword__count__  dq 0
            __counter__loop__count__   dd 0
            __counter__loop__counter__ dd 0
            __process_priority_class__ dd 0
            __thread_priority__        dd 0
            __current_process__        dd 0
            __current_thread__         dd 0
          .code
        ENDIF

        mov __counter__loop__count__, loopcount
        invoke GetCurrentProcess
        mov __current_process__, eax
        invoke GetPriorityClass, __current_process__
        mov __process_priority_class__, eax
        invoke SetPriorityClass, __current_process__, process_priority
        IFNB <thread_priority>
            invoke GetCurrentThread
            mov _current_thread__, eax
            invoke GetThreadPriority, _current_thread__
            mov __thread_priority__, eax
            invoke SetThreadPriority, _current_thread__, thread_priority
        ENDIF
        xor eax, eax          ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter

        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop
        sub __counter__loop__counter__, 1
        jnz @B

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of overhead count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of overhead count in EDX
        push edx              ;; Preserve high-order 32 bits of overhead count
        push eax              ;; Preserve low-order 32 bits of overhead count

        xor eax, eax
        cpuid
        rdtsc
        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      label:                  ;; Start test loop
        __counter__loop__label__ equ <label>
    ENDM

    counter_end MACRO
        sub __counter__loop__counter__, 1
        jnz  __counter__loop__label__

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of test count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of test count in EDX
        pop ecx               ;; Recover low-order 32 bits of overhead count
        sub eax, ecx          ;; Low-order 32 bits of adjusted count in EAX
        pop ecx               ;; Recover high-order 32 bits of overhead count
        sbb edx, ecx          ;; High-order 32 bits of adjusted count in EDX

        mov DWORD PTR __counter__qword__count__, eax
        mov DWORD PTR __counter__qword__count__ + 4, edx

        invoke SetPriorityClass,__current_process__,__process_priority_class__
        IFNB <thread_priority>
            invoke SetThreadPriority, __current_thread__, __thread_priority__
        ENDIF

        finit
        fild __counter__qword__count__
        fild __counter__loop__count__
        fdiv
        fistp __counter__qword__count__

        mov eax, DWORD PTR __counter__qword__count__
    ENDM


I think altering the priority for each loop may not be the best way to do it, because there is no knowing what hoops Windows may be jumping through to do this. Perhaps altering the priority at startup would produce better results.

Edit:

Added the:

mov __thread_priority__, eax


That I left out.



Well Microsoft, here's another nice mess you've gotten us into.