Before i ask my question i think i should let you know that i am beginner in masm and in programming altogether.
My problem is - as the title say, with local variables, they get constantly overwritten between messages in all of my small programs.
For example if i declare two local variables
LOCAL localx:DWORD
LOCAL localy:DWORD
then use them to store client area size during WM_SIZE message, the value in locals will change (will be overwritten) before it reaches WM_PAINT message, resulting code will paint itself outside client area, obviously.
Usually i would think that the problem is with me, but the same program runs as expected if i declare those variables as global.
Just changing code into this
.DATA
localx DWORD 0
localy DWORD 0
will make everything work fine. I made sure that the values originally stored are fine, and that later they get overwritten by loading programs into OllyDbg.
As i said that happens with all of my programs, and i couldn't find any solution to it, except declaring global variables.
The question is of course why is this happening.
I would attach an example but the only one i currently have is using high resolution .BMP image in resource, and is 20MB large.
Hi Gwyn
it never happens with me. It is strange to me.
I think the problem may be in the register EBP
used to access those variables. Is it preserved ?
Strange is also what you said: that happens with all of my programs!
Another question: how do you store client area size during WM_SIZE message ? could you show what are you doing ?
Well i never used local variables in window procedures
Hi Gwyn,
Your results are normal. Nothing unexpected. Local variables are always local to their host procedure and the scope is not global. Uninitialized local variables will contain garbage values from the stack. You need to store your critical variables in the .data or .data? section.
A quick example : you can check Iczelion's Simple Bitmap tutorial :
.data?
hBitmap dd ?
.
.
WndProc proc hWnd:HWND, uMsg:UINT, wParam:WPARAM, lParam:LPARAM
LOCAL ps:PAINTSTRUCT
LOCAL hdc:HDC
LOCAL hMemDC:HDC
LOCAL rect:RECT
.if uMsg==WM_CREATE
invoke LoadBitmap,hInstance,IDB_MAIN
mov hBitmap,eax
hBitmap is stored in the uninitialized section .data? No guaranty for the local variables in the procedure to survive across consecutive calls.
http://win32assembly.programminghorizon.com/tut25.html
Quote from: Vortex on February 20, 2013, 08:58:53 AM
Hi Gwyn,
No guaranty for the local variables in the procedure to survive across consecutive calls.
I don't understand completly, so you are saying that it's normal that values stored during WM_SIZE message in local variables, won't be the same in WM_PAINT message. I thought that the scope in which the value of local variables will be preserved is the whole procedure (for example WndProc), and now you are saying that this just applies to calls between messages.
Could you confirm that?
Because i did not know that, the thing that confused me was the Petzold's book examples were he does exactly that, so i thought that i was doing something wrong when it didn't work for me.
@ RuiLoureiro
Here is the code in attachment , it's simple mapping mode example where i change mapping mode to MM_ISOTROPIC and display bitmap(2048×3072) from resource in the center of the client area.
I don't know what Petzold book you are referring to, but in C for a local variable to be preserved between calls you have to declare it with the storage class specifier static, which causes the compiler to store it in the uninitialized (BSS) data section, but continue to limit the scope of the variable to within the procedure.
#include <windows.h>
#include <conio.h>
#include <stdio.h>
int test1( void )
{
int i;
int r = i;
i = 123;
return r;
}
int test2( void )
{
static int i;
int r = i;
i = 123;
return r;
}
int main( void )
{
printf("%d\n", test1());
printf("%d\n", test1());
printf("%d\n", test2());
printf("%d\n", test2());
getch();
}
200084144
4198483
0
123
_r$ = -8 ; size = 4
_i$ = -4 ; size = 4
_test1 PROC NEAR
; File c:\program files\microsoft visual c++ toolkit 2003\my\static\test.c
; Line 5
push ebp
mov ebp, esp
sub esp, 8
; Line 7
mov eax, DWORD PTR _i$[ebp]
mov DWORD PTR _r$[ebp], eax
; Line 8
mov DWORD PTR _i$[ebp], 123 ; 0000007bH
; Line 9
mov eax, DWORD PTR _r$[ebp]
; Line 10
mov esp, ebp
pop ebp
ret 0
_test1 ENDP
. . .
_BSS SEGMENT
?i@?1??test2@@9@9 DD 01H DUP (?) ; `test2'::`2'::i
; Function compile flags: /Odt
_BSS ENDS
_TEXT SEGMENT
_r$ = -4 ; size = 4
_test2 PROC NEAR
; Line 12
push ebp
mov ebp, esp
push ecx
; Line 14
mov eax, DWORD PTR ?i@?1??test2@@9@9
mov DWORD PTR _r$[ebp], eax
; Line 15
mov DWORD PTR ?i@?1??test2@@9@9, 123 ; 0000007bH
; Line 16
mov eax, DWORD PTR _r$[ebp]
; Line 17
mov esp, ebp
pop ebp
ret 0
_test2 ENDP
With MASM you have to put it in the data section, and accept a global scope.
local variable contents are volatile
they only remain valid for the instance of a single call to the routine
when the WM_SIZE message is received, that is one instance
when the WM_PAINT message is received, that is another instance
each message is a seperate instance
this can be overcome by using global variables (in the .DATA? or .DATA section)
in this particular case, you may not need to store the information
when you call the BeginPaint function, it fills a PAINTSTRUCT structure
part of that structure is a RECT rectangle structure that describes what part of the client area needs to be drawn
Quote from: MichaelW on February 20, 2013, 11:39:10 AM
I don't know what Petzold book you are referring to, but in C for a local variable to be preserved between calls you have to declare it with the storage class specifier static, which causes the compiler to store it in the uninitialized (BSS) data section, but continue to limit the scope of the variable to within the procedure.
I got it now, yes he did use static for those declarations, but because i don't know C i didn't know there is any difference between the two of them, i just thought they are regular local variables, and that is why i naively used regular local variables in masm and tried to do the same.
Thanks everybody for their posts, i can say i understand now how to properly use local variables.
Gwyn,
«Thanks everybody for their posts,
i can say i understand now how to properly use local variables»
. So i think your problem is solved.
Meanwhile i want to say this:
1. I use local variables very very rarely;
This is why i have no problems with them!
2. I never use it in window procedures.
And if we want to use it there
we can use inside one message and
not from one to another;
There is no problem with LOCAL ps:PAINTSTRUCT
because it is memory to be used by
BeginPaint/EndPaint only when uMsg==WM_PAINT.
3. Local variables should be initialized.
They are not 0 ! They are something !
4. Why to declare it static if i can define
it in the data section ?
. Your problem is with LOCAL pos:POINT.
When the system calls WndProc with uMsg==WM_SIZE
you save pos.y and pos.x. But when the system
comes again with uMsg==WM_PAINT, pos.y and pos.x
is any value nothing to do with the previous
QuoteI use local variables very very rarely;
This is why i have no problems with them!
a lot of C programmers tell us to always use LOCALS, never use GLOBALS - lol
they are both tools - use the right tool for the job
QuoteI never use it in window procedures.
i try to avoid using LOCAL's in WndProc, as well
the way i do it is....
.if uMsg==WM_PAINT
INVOKE PaintProc,hWnd
xor eax,eax
then, i put the LOCAL's in the PaintProc
QuoteWhy to declare it static if i can define it in the data section ?
i thought that's what a "static" variable (C term) is in assembler
Quote from: RuiLoureiro on February 21, 2013, 05:02:27 AM1. I use local variables very very rarely;
that is very very unwise! Beside the straightforward scope, it can also be assumed that locals are always cached.
All over, is very simple: if you need to save global states, use global variables
* - otherwise use locals.
Quote from: RuiLoureiro on February 21, 2013, 05:02:27 AM4. Why to declare it static if i can define
it in the data section ?
static == variable in data section with local scope.(?)
* assuming a single threaded environment.
Dave,
«a lot of C programmers tell us to always use LOCALS, never use GLOBALS»
. generally i dont need it, no LOCALS, no GLOBALS
qWord,
« that is very very unwise!
Beside the straightforward scope, it can also be assumed that locals are always cached.»
. to get the function that a local variable do inside a proc
i try to use other tricks. Meanwhile it is not easy to write local
variables when we want to use esp to access the stack (not ebp).
Rui, this may be related to that other program being 230 kb :biggrin:
but, i suspect you just have some data declared in the .DATA section that could be in .DATA?
Dave, Yes i have a lot of data in .data section.
And yes some of them could be in .data?. One day i
will do that work ! ;)
QuoteMeanwhile it is not easy to write local variables when we want to use esp to access the stack (not ebp).
With a little care, it's possible.
Quote from: Vortex on February 21, 2013, 06:25:54 AM
QuoteMeanwhile it is not easy to write local variables when we want to use esp to access the stack (not ebp).
With a little care, it's possible.
Yes i know and i do
Quote from: qWord on February 21, 2013, 05:57:11 AMit can also be assumed that locals are always cached
Is that a valid argument, given that you have to write to them before you use them?
Even assuming that the processors don't provide preferential caching of the stack, the active stack would likely be cached because it's frequently accessed.
Right, the garbage that is in your local variables is probably cached.
But the good values that you have in global variables is also probably cached, and you don't have to write to the global memory each and every time before using them. So they are faster on average.
Good point, but consider that caching also works for write accesses. So for locals the initial write would likely be cached, where for the first of a localized group of globals the initial read would likely be uncached. Which is faster would depend on your global access patterns, but I agree that initialized globals are likely to be faster on average.
Quote from: MichaelW on February 21, 2013, 12:15:26 PM
Good point, but consider that caching also works for write accesses. So for locals the initial write would likely be cached, where for the first of a group of globals the initial read would likely be uncached.
For a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache. But all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.
Quote from: jj2007 on February 21, 2013, 12:31:12 PMFor a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache.
... and all that stuff from the other processes and threads.
Quote from: jj2007 on February 21, 2013, 12:31:12 PMBut all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.
no, IIRC we commonly have write-back catch for user-land memory, thus a write back to phy. mem. only occurs if the catch is full, some kind of synchronization applies or the catch control decides that the region is no longer needed.
Quote from: qWord on February 21, 2013, 12:47:42 PM
Quote from: jj2007 on February 21, 2013, 12:31:12 PMFor a standard GUI app, my trusty Celeron's L1 cache of 32kb will be sufficient to keep all global variables in the cache.
... and all that stuff from the other processes and threads.
Basically, when you context switch, all of the memory addresses that the processor "remembers" in it's cache effectively become useless.
(http://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch)
QuoteQuote from: jj2007 on February 21, 2013, 12:31:12 PMBut all local variables, with no exception, must be written not only to the cache but also to physical memory each and every time you want to use them because what is cached of them is garbage.
no, IIRC we commonly have write-back catch for user-land memory, thus a write back to phy. mem. only occurs if the catch is full, some kind of synchronization applies or the catch control decides that the region is no longer needed.
"a write back to phy. mem. only occurs if the
catch cache is full" may be correct but is irrelevant. Your LOCAL rc:RECT in the WndProc may be at a different address every time you write to it. Remember that on entry to the WndProc, esp varies.
Quote from: jj2007 on February 21, 2013, 01:26:02 PM
Your LOCAL rc:RECT in the WndProc may be at a different address every time you write to it. Remember that on entry to the WndProc, esp varies.
A different address, but still likely a cached address.
Quote from: jj2007 on February 21, 2013, 01:26:02 PMBasically, when you context switch, all of the memory addresses that the processor "remembers" in it's cache effectively become useless. (http://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch)
That make not much sense, because caches also works with physical address and not only with virtual addresses. Also, looking in Intel's manuals, you will see that there are solutions for the problem of different address spaces.
Even thought that theory would make the large caches that are nowadays used become useless, because that would mean the whole cache needs to be copied for each of the thousands context switches per second.
this is a bit beyond the campus :P
but, i would think the internal cache is somehow shared between contexts
an external cache can be switched with the context by changing page table entries
If the first use for a local is as a storage destination, as you would use a local PAINTSTRUCT for example, then there is no initialization penalty. This code compares the access times for globals and locals, sequential access and random access:
;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
;--------------------------------------------------------
; This is an assembly-time random number generator based
; on code by George Marsaglia:
; #define znew ((z=36969*(z&65535)+(z>>16))<<16)
; #define wnew ((w=18000*(w&65535)+(w>>16))&65535)
; #define MWC (znew+wnew)
;--------------------------------------------------------
@znew_seed@ = 362436069
@wnew_seed@ = 521288629
@rnd MACRO base:REQ
LOCAL znew, wnew
@znew_seed@ = 36969 * (@znew_seed@ AND 65535) + (@znew_seed@ SHR 16)
znew = @znew_seed@ SHL 16
@wnew_seed@ = 18000 * (@wnew_seed@ AND 65535) + (@wnew_seed@ SHR 16)
wnew = @wnew_seed@ AND 65535
EXITM <(znew + wnew) MOD base>
ENDM
;==============================================================================
ARRAY_SIZE equ 100
;==============================================================================
.data
array1 dd ARRAY_SIZE dup(?)
.code
;==============================================================================
align 4
globals_seq proc
lea esi, array1
xor ecx, ecx
@@:
mov eax, [esi+ecx*4]
inc ecx
cmp ecx, ARRAY_SIZE
jb @B
ret
globals_seq endp
align 4
locals_seq proc
LOCAL array2[ARRAY_SIZE]:DWORD
lea esi, array2
xor ecx, ecx
@@:
mov eax, [esi+ecx*4]
inc ecx
cmp ecx, ARRAY_SIZE
jb @B
ret
locals_seq endp
align 4
globals_rnd proc
lea esi, array1
REPEAT ARRAY_SIZE
mov eax, @rnd(ARRAY_SIZE)
mov eax, [esi+eax*4]
ENDM
ret
globals_rnd endp
align 4
locals_rnd proc
LOCAL array2[ARRAY_SIZE]:DWORD
lea ebx, array2
REPEAT ARRAY_SIZE
mov eax, @rnd(ARRAY_SIZE)
mov eax, [esi+eax*4]
ENDM
ret
locals_rnd endp
;==============================================================================
start:
;==============================================================================
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
REPEAT 100
mov eax, @rnd(ARRAY_SIZE)
printf("%d\t",eax)
ENDM
printf("\n")
invoke Sleep, 5000
REPEAT 3
counter_begin 10000000, HIGH_PRIORITY_CLASS
counter_end
printf("%d cycles, empty\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
call globals_seq
counter_end
printf("%d cycles, globals_seq\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
call locals_seq
counter_end
printf("%d cycles, locals_seq\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
call globals_rnd
counter_end
printf("%d cycles, globals_rnd\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
call locals_rnd
counter_end
printf("%d cycles, locals_rnd\n\n", eax)
ENDM
inkey
exit
;==============================================================================
end start
Running on a P3 (Katmai):
0 cycles, empty
310 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd
0 cycles, empty
311 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd
0 cycles, empty
310 cycles, globals_seq
213 cycles, locals_seq
100 cycles, globals_rnd
102 cycles, locals_rnd
Running on a P4 (Northwood):
1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
89 cycles, locals_rnd
1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
96 cycles, locals_rnd
1 cycles, empty
219 cycles, globals_seq
221 cycles, locals_seq
88 cycles, globals_rnd
89 cycles, locals_rnd
For the P3, increasing the array size to 200 elements or dropping it to 30 elements increased the cycle count for the local array sequential access, putting it close to the count for the global array, I suspect because of caching effects.
I've looked at it from a different angle:
- local vars, need to be initialised
- global vars, either static or assigned in proc
- random stack as in a normal WndProc
Results:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 433/200 cycles
2081 cycles for 200 * local
979 cycles for 200 * global no init
1068 cycles for 200 * global w init
19630 cycles for 200 * local, random stack
18900 cycles for 200 * global w init, random stack
18689 cycles for 200 * global no init, random stack
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3
loop overhead is approx. 355/200 cycles
1697 cycles for 200 * local
1059 cycles for 200 * global no init
1265 cycles for 200 * global w init
18163 cycles for 200 * local, random stack
16680 cycles for 200 * global w init, random stack
16893 cycles for 200 * global no init, random stack
The "random stack" results are distorted because nrandom is very slow. There is an option useMB=1 that switches to MasmBasic's fast Rand(). Example:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 356/200 cycles
1696 cycles for 200 * local
1058 cycles for 200 * global no init
1264 cycles for 200 * global w init
5281 cycles for 200 * local, random stack
4295 cycles for 200 * global w init, random stack
4731 cycles for 200 * global no init, random stack
Quote from: MichaelW on February 21, 2013, 05:42:58 PMThis code compares the access times for globals and locals, sequential access and random access:
If I modifie the test bed, thus the Sleep in moved into the REPAT loop, the loop count is ten and the repetition count of the counter-macro is one, I get the following results, (Intel i7 3610QM):
Press any key to continue ...
94 50 70 51 17 51 90 81 74 73 92 11 25 99 90 40 31 23 25 64 64 57 99 27 18 40 59 0 49 3 23 85 30 7 13 86 10 12 31 49 71 1 30 58 71 24 40 46 50 84 84 26 1 9 81 29 10 80 49 82 58 18 38 59 61 64 31 42 18 20 27 11 21 74 73 6 42 33 98 28 60 57 62 88 89 5 76 75 1 29 47 91 95 24 35 38 17 63 32 38
-134 cycles, empty
546 cycles, globals_seq
186 cycles, locals_seq
906 cycles, globals_rnd
210 cycles, locals_rnd
-32 cycles, empty
938 cycles, globals_seq
389 cycles, locals_seq
1608 cycles, globals_rnd
453 cycles, locals_rnd
0 cycles, empty
1244 cycles, globals_seq
435 cycles, locals_seq
1288 cycles, globals_rnd
354 cycles, locals_rnd
7 cycles, empty
1334 cycles, globals_seq
561 cycles, locals_seq
1270 cycles, globals_rnd
338 cycles, locals_rnd
0 cycles, empty
904 cycles, globals_seq
577 cycles, locals_seq
1603 cycles, globals_rnd
373 cycles, locals_rnd
30 cycles, empty
1113 cycles, globals_seq
621 cycles, locals_seq
1247 cycles, globals_rnd
704 cycles, locals_rnd
-30 cycles, empty
1012 cycles, globals_seq
423 cycles, locals_seq
805 cycles, globals_rnd
522 cycles, locals_rnd
-44 cycles, empty
989 cycles, globals_seq
614 cycles, locals_seq
1270 cycles, globals_rnd
989 cycles, locals_rnd
-2 cycles, empty
952 cycles, globals_seq
550 cycles, locals_seq
858 cycles, globals_rnd
492 cycles, locals_rnd
-53 cycles, empty
948 cycles, globals_seq
573 cycles, locals_seq
1290 cycles, globals_rnd
522 cycles, locals_rnd
We should not blend out cache misses by using high loop counts.
the result for the unmodified test:
94 50 70 51 17 51 90 81 74 73
92 11 25 99 90 40 31 23 25 64
64 57 99 27 18 40 59 0 49 3
23 85 30 7 13 86 10 12 31 49
71 1 30 58 71 24 40 46 50 84
84 26 1 9 81 29 10 80 49 82
58 18 38 59 61 64 31 42 18 20
27 11 21 74 73 6 42 33 98 28
60 57 62 88 89 5 76 75 1 29
47 91 95 24 35 38 17 63 32 38
-5 cycles, empty
112 cycles, globals_seq
112 cycles, locals_seq
33 cycles, globals_rnd
34 cycles, locals_rnd
0 cycles, empty
113 cycles, globals_seq
112 cycles, locals_seq
34 cycles, globals_rnd
34 cycles, locals_rnd
0 cycles, empty
112 cycles, globals_seq
112 cycles, locals_seq
33 cycles, globals_rnd
34 cycles, locals_rnd
Press any key to continue ...
globals can be faster, because we often use absolute-direct addressing with them
mov eax,GlobalVar
not to mention, when a local is created, the routine has to adjust the stack - lol
passing the address of a global can also be faster - you push a constant
for a local, the assembler uses LEA, then PUSH EAX
PUSH EAX is fast, but the fact that you also have an LEA hurts
I think this thread may need to be split.
I used an assembly-time RNG so I could avoid an overhead count of 20-30 times the count for the array access.
The stack adjustment to make room for the locals is fast.
And I failed to consider that the high loop count would "blend out" cache misses, when the only thing I can see that could make locals faster is a reduction in cache misses.
This is a quick modification of the cycle count macros to allow control of the thread priority, the idea being to get consistent counts in only a small number of loops, and hopefully only one loop. I cannot reasonably run at the highest possible priority on my P3, and on my P4 w HT I had to use a loop count of 100 to get even reasonable consistency. Perhaps the newer processors, with multiple physical cores, can do better.
; ----------------------------------------------------------------------
; These two macros perform the grunt work involved in measuring the
; processor clock cycle count for a block of code. These macros must
; be used in pairs, and the block of code must be placed in between
; the counter_begin and counter_end macro calls. The counter_end macro
; returns the clock cycle count for a single pass through the block of
; code, corrected for the test loop overhead, in EAX.
;
; These macros require a .586 or higher processor directive.
;
; The essential differences between these macros and the prvious macros
; are that these save and restore the original priorities, and provide
; a way to control the thread priority. Control of the thread priority
; allows timing code at the highest possible priority by combining
; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
;
; Note that running at the higher priority settings on a single core
; processor involves some risk, as it will cause your process to
; preempt *all* other processes, including critical Windows processes.
; Using HIGH_PRIORITY_CLASS in combination with THREAD_PRIORITY_NORMAL
; should generally be safe.
; ----------------------------------------------------------------------
counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
LOCAL label
IFNDEF __counter__qword__count__
.data
ALIGN 8 ;; Optimal alignment for QWORD
__counter__qword__count__ dq 0
__counter__loop__count__ dd 0
__counter__loop__counter__ dd 0
__process_priority_class__ dd 0
__thread_priority__ dd 0
__current_process__ dd 0
__current_thread__ dd 0
.code
ENDIF
mov __counter__loop__count__, loopcount
invoke GetCurrentProcess
mov __current_process__, eax
invoke GetPriorityClass, __current_process__
mov __process_priority_class__, eax
invoke SetPriorityClass, __current_process__, process_priority
IFNB <thread_priority>
invoke GetCurrentThread
mov _current_thread__, eax
invoke GetThreadPriority, _current_thread__
mov __thread_priority__, eax
invoke SetThreadPriority, _current_thread__, thread_priority
ENDIF
xor eax, eax ;; Use same CPUID input value for each call
cpuid ;; Flush pipe & wait for pending ops to finish
rdtsc ;; Read Time Stamp Counter
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
@@: ;; Start an empty reference loop
sub __counter__loop__counter__, 1
jnz @B
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of overhead count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of overhead count in EDX
push edx ;; Preserve high-order 32 bits of overhead count
push eax ;; Preserve low-order 32 bits of overhead count
xor eax, eax
cpuid
rdtsc
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
label: ;; Start test loop
__counter__loop__label__ equ <label>
ENDM
counter_end MACRO
sub __counter__loop__counter__, 1
jnz __counter__loop__label__
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of test count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of test count in EDX
pop ecx ;; Recover low-order 32 bits of overhead count
sub eax, ecx ;; Low-order 32 bits of adjusted count in EAX
pop ecx ;; Recover high-order 32 bits of overhead count
sbb edx, ecx ;; High-order 32 bits of adjusted count in EDX
mov DWORD PTR __counter__qword__count__, eax
mov DWORD PTR __counter__qword__count__ + 4, edx
invoke SetPriorityClass,__current_process__,__process_priority_class__
IFNB <thread_priority>
invoke SetThreadPriority, __current_thread__, __thread_priority__
ENDIF
finit
fild __counter__qword__count__
fild __counter__loop__count__
fdiv
fistp __counter__qword__count__
mov eax, DWORD PTR __counter__qword__count__
ENDM
I think altering the priority for each loop may not be the best way to do it, because there is no knowing what hoops Windows may be jumping through to do this. Perhaps altering the priority at startup would produce better results.
Edit:
Added the:
mov __thread_priority__, eax
That I left out.
Quote from: MichaelW on February 22, 2013, 04:31:22 AMAnd I failed to consider that the high loop count would "blend out" cache misses, when the only thing I can see that could make locals faster is a reduction in cache misses.
After the first round everything is cached for all variants. From this point on we are measuring the plain execution time of instructions and not the memory access - Isn't this what we want to measure when proving the postulation "it can also be assumed that locals are always cached"?
Looking on my result with your (unmodified) test, we can see that at least on my i7 there no difference between the global and local variant (the same as for your P4) - that's why I think we should make one-shoot measurements.