Author Topic: Code location sensitivity of timings  (Read 36172 times)

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Code location sensitivity of timings
« on: July 12, 2014, 09:15:44 PM »
This subject is from the old forum

When code is executed it is cached more like the file cache system so code/files are reused to improve performance. This is a problem when using clock timings on file access/read/copy, and also when similar algorithms of code are compared.

I don’t know exactly how this cache system works but it could be illustrated like this:
Code: [Select]
SSE16 macro x
.data
info_&x&  db "memcpy SSE 16",0
.code
test_&x& proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
and ecx,-16
jz tail
     @@:
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
jnz @B
mov ecx,count
movdqu  xmm0,[esi+ecx-16]
movdqu  [edi+ecx-16],xmm0
  toend:
mov eax,dst
ret
   tail:
mov ecx,count
rep stosb
jmp toend
db 99 dup(90h)
test_&x& endp
size_&x& equ $ - test_&x&
endm

SSE16 0
SSE16 1
SSE16 2
SSE16 3
SSE16 4
SSE16 5

which gives this (random) result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------
1232712 cycles - 2048..4096  (164) A memcpy SSE 16
877928  cycles - 2048..4096  (164) A memcpy SSE 16
1232112 cycles - 2048..4096  (164) A memcpy SSE 16
878896  cycles - 2048..4096  (164) A memcpy SSE 16
1233148 cycles - 2048..4096  (164) A memcpy SSE 16
879296  cycles - 2048..4096  (164) A memcpy SSE 16

1233667 cycles - 2048..4096  (164) U memcpy SSE 16
1016403 cycles - 2048..4096  (164) U memcpy SSE 16
1234898 cycles - 2048..4096  (164) U memcpy SSE 16
1016440 cycles - 2048..4096  (164) U memcpy SSE 16
1231145 cycles - 2048..4096  (164) U memcpy SSE 16
1019004 cycles - 2048..4096  (164) U memcpy SSE 16

However, the result repeat itself (on the same CPU/OS), so the pattern is more consistent than random. The pattern could also be manipulated by changing the size of the proc or insert random sized code above it, so what to do?

Disable the cache during the test would be nice, if you know how, which I don't, so the simplest solution seems to be overflowing the system with garbage between tests and not have any code above the first one.

The actual size and scope of the cache may differ from various OS/CPU’s but from the result at least 512 byte. Newer systems may also be larger and more selective.

I wrote a new test-bed where a 4096 byte buffer was added to each proc to try flushing the code cache between them:
Code: [Select]
jmp toend
db 99 dup(90h)
test_&x& endp
size_&x& equ $ - test_&x&
fill_&x& db PROCSIZE - size_&x& dup(0)
endm

The result now is slow and more equal so it seems to have an effect
Code: [Select]
1229362 cycles - 2048..4096  (164) A memcpy SSE 16
1234968 cycles - 2048..4096  (164) A memcpy SSE 16
1231486 cycles - 2048..4096  (164) A memcpy SSE 16
1229533 cycles - 2048..4096  (164) A memcpy SSE 16
1235589 cycles - 2048..4096  (164) A memcpy SSE 16
1229830 cycles - 2048..4096  (164) A memcpy SSE 16

1234907 cycles - 2048..4096  (164) U memcpy SSE 16
1232686 cycles - 2048..4096  (164) U memcpy SSE 16
1231493 cycles - 2048..4096  (164) U memcpy SSE 16
1235533 cycles - 2048..4096  (164) U memcpy SSE 16
1231223 cycles - 2048..4096  (164) U memcpy SSE 16
1231564 cycles - 2048..4096  (164) U memcpy SSE 16

Intel seems more selective with regards to what ends up in the cache and don't seems to fall,  at least to the same extent, for the dup(0) trap.
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
------------------------------------------------------
1137020 cycles - 2048..4096  (164) A memcpy SSE 16
1143048 cycles - 2048..4096  (164) A memcpy SSE 16
1155096 cycles - 2048..4096  (164) A memcpy SSE 16
910107  cycles - 2048..4096  (164) A memcpy SSE 16
924623  cycles - 2048..4096  (164) A memcpy SSE 16
865810  cycles - 2048..4096  (164) A memcpy SSE 16

980038  cycles - 2048..4096  (164) U memcpy SSE 16
925320  cycles - 2048..4096  (164) U memcpy SSE 16
1021239 cycles - 2048..4096  (164) U memcpy SSE 16
881953  cycles - 2048..4096  (164) U memcpy SSE 16
965462  cycles - 2048..4096  (164) U memcpy SSE 16
850145  cycles - 2048..4096  (164) U memcpy SSE 16
...
978347  cycles - 2048..4096  (164) A memcpy SSE 16
892772  cycles - 2048..4096  (164) A memcpy SSE 16
927040  cycles - 2048..4096  (164) A memcpy SSE 16
917530  cycles - 2048..4096  (164) A memcpy SSE 16
913893  cycles - 2048..4096  (164) A memcpy SSE 16
947837  cycles - 2048..4096  (164) A memcpy SSE 16

932098  cycles - 2048..4096  (164) U memcpy SSE 16
925095  cycles - 2048..4096  (164) U memcpy SSE 16
914343  cycles - 2048..4096  (164) U memcpy SSE 16
924735  cycles - 2048..4096  (164) U memcpy SSE 16
920416  cycles - 2048..4096  (164) U memcpy SSE 16
922230  cycles - 2048..4096  (164) U memcpy SSE 16

The test works well for the AMD in this case, but it should be possible to write a StuffCache proc of some size with "strange" upcode to fill up the cache with some unrelated code. Calling this proc before each test may then clear (or replace) the cache content.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #1 on: July 12, 2014, 10:06:49 PM »
i suppose you could use VirtualProtect to allow writes in the .CODE section
then, copy the code under test into a common code address space before executing it

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx

to help speed testing up a little....
you could use different counter_begin loop count values for the short and long tests
i try to select a loop count that yields about 0.5 seconds per pass

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #2 on: July 12, 2014, 10:11:45 PM »
prescott w/htt - xp sp3
unaligned
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4925905 cycles - 2048..4096  (164) A memcpy SSE 16
4933332 cycles - 2048..4096  (164) A memcpy SSE 16
4953203 cycles - 2048..4096  (164) A memcpy SSE 16
4963909 cycles - 2048..4096  (164) A memcpy SSE 16
4923198 cycles - 2048..4096  (164) A memcpy SSE 16
4941277 cycles - 2048..4096  (164) A memcpy SSE 16

11502669        cycles - 2048..4096  (164) U memcpy SSE 16
11487135        cycles - 2048..4096  (164) U memcpy SSE 16
11564951        cycles - 2048..4096  (164) U memcpy SSE 16
11570118        cycles - 2048..4096  (164) U memcpy SSE 16
11497558        cycles - 2048..4096  (164) U memcpy SSE 16
11526087        cycles - 2048..4096  (164) U memcpy SSE 16

aligned
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4935114 cycles - 2048..4096  (164) A memcpy SSE 16
4934727 cycles - 2048..4096  (164) A memcpy SSE 16
4942523 cycles - 2048..4096  (164) A memcpy SSE 16
4924574 cycles - 2048..4096  (164) A memcpy SSE 16
4924658 cycles - 2048..4096  (164) A memcpy SSE 16
4937763 cycles - 2048..4096  (164) A memcpy SSE 16

11490869        cycles - 2048..4096  (164) U memcpy SSE 16
11481780        cycles - 2048..4096  (164) U memcpy SSE 16
11616596        cycles - 2048..4096  (164) U memcpy SSE 16
11530420        cycles - 2048..4096  (164) U memcpy SSE 16
11488318        cycles - 2048..4096  (164) U memcpy SSE 16
11504392        cycles - 2048..4096  (164) U memcpy SSE 16

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #3 on: July 12, 2014, 11:04:51 PM »
Your CPU seems to be free of this problem so that makes the test result more reliable. My CPU are a bit moody.

copy the code under test into a common code address space before executing it

brilliant: a location problem solved by relocation  :lol:

I used the data segment for the code and that seems to work just fine
Code: [Select]
.data
proc_x  db PAGESIZE dup(?)
...
lea edi,proc_x
lea esi,test_&x&
mov ecx,size_&x&
rep movsb
invoke  Sleep, 200
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,l1
.repeat
push edi
push offset source
push a2
lea eax,proc_x
call eax
; invoke  test_&x&,a2,addr source,edi

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #4 on: July 12, 2014, 11:21:46 PM »
hmmm - that seems wrong
i thought you had to be in a code section to assemble instructions
guess i've never tried it - lol

but - there is nothing to prevent you from putting a proc in the code section and copying it to another address
so long as you allow writes into the affected pages with VirtualProtect and PAGE_EXECUTE_READWRITE

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #5 on: July 13, 2014, 02:28:37 AM »
I don't think it's a problem using the data segment
Code: [Select]
.data
db "I'm in the _DATA segment ;-)",13,10,0
hello_dave proc
invoke crt_printf,$ - 31
ret
hello_dave endp
.code
start:
call hello_dave

I tried the strlen test and it worked fine
that is, this work:
Code: [Select]
test_0  proc uses edi string:dword
invoke  crt_strlen,string
ret
test_0  endp
but this failed:
Code: [Select]
test_5 proc string:dword
void len(string)
ret
test_5  endp
suspect putting it in the code segment will be the same though

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #6 on: July 13, 2014, 03:03:15 AM »
i hadn't thought about that
but - most win32 CALL's are near relative
so, you'd have to translate the target addresses

but - for testing algorithm code that doesn't make any calls, like loops, etc,
the branch addresses are relative, but the target moves with the code   :P

if you wanted to use calls or invokes in movable code,
you could store the branch address and use CALL DWORD PTR lpfnFunction
or MOV EAX,Function and CALL EAX

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #7 on: July 13, 2014, 04:10:22 AM »
well, here is a cute one
Code: [Select]
ifdef MASMBASIC
 include \masm32\MasmBasic\MasmBasic.inc
else
 include \masm32\include\masm32rt.inc
endif
.686
.xmm
include \masm32\macros\timers.asm

PAGESIZE equ 4096

.data?
align 16
proc_x  db PAGESIZE*4 dup(?)
error dd ?

.data
immed equ 1
mem dd ?
mem16 dw ?,?
mem8 db ?,?,?,?

info_0  db "prolog",0
.code
test_0  proc
local char[1]:byte
ret
test_0  endp
size_0  equ $ - test_0

UPCODE macro x,asm:VARARG
.data
info_&x&  db "&asm&",0
.code
test_&x& proc
local char[1]:byte
repeat 1000
asm
endm
@@: ret
test_&x&  endp
size_&x&  equ $ - test_&x&
endm

UPCODE 1,mov eax,immed
UPCODE 2,mov eax,mem
UPCODE 3,mov ax,mem16
UPCODE 4,mov al,mem8
UPCODE 5,movzx eax,mem16
UPCODE 6,movsx eax,mem8
UPCODE 7,push eax
UPCODE 8,pop eax
UPCODE 9,push mem
UPCODE A,pop mem
UPCODE B,lea eax,mem
UPCODE C,lea eax,[eax+1]
UPCODE D,add eax,immed
UPCODE E,sub eax,immed
UPCODE F,add eax,mem
UPCODE G,sub eax,mem
UPCODE H,@@: jmp @F
UPCODE I,@@: jz @F
UPCODE J,@@: jnz @F
UPCODE K,test eax,eax
UPCODE L,test eax,mem
UPCODE M,cmp eax,eax
UPCODE N,cmp eax,mem

procs equ <for x,<0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N>>

;***********************************************************************************************

; get the cycle count for each algo

test_algo macro x,l1
push esi
push edi
push ebx
lea edi,proc_x
lea esi,test_&x&
mov ecx,size_&x&
rep movsb
invoke  Sleep, 200
counter_begin 100, HIGH_PRIORITY_CLASS
mov edi,l1
lea ebx,proc_x
.repeat
call ebx
dec edi
.until  !edi
counter_end
printf("%d\tcycles - %d (%4d)  %s\n",eax,l1,size_&x&,addr info_&x&)
pop ebx
pop edi
pop esi
endm

validate_x macro x ; test if the algo actually works..
invoke test_&x& ; and spin-up the CPU..
endm

;***********************************************************************************************

main proc
procs
    validate_x x
    cmp error,0
    jne toend
endm
procs
    test_algo x,32
endm
toend:
ret
main endp

;***********************************************************************************************

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu  ; simple, no printing, just returns SSE level
  push 1, call ShowCpu  ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
.Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

start:
print " ", 13, 10
push 1
call ShowCpu ; print brand string and SSE level
print "------------------------------------------------------", 13, 10
call main
inkey chr$("--- ok ---", 13)
exit

end start

result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
--------------------------------------------
266     cycles - 32 (  10)  prolog
10916   cycles - 32 (5010)  mov eax,immed
16301   cycles - 32 (5010)  mov eax,mem
32171   cycles - 32 (6010)  mov ax,mem16
32164   cycles - 32 (5010)  mov al,mem8
16316   cycles - 32 (7010)  movzx eax,mem16
16317   cycles - 32 (7010)  movsx eax,mem8
24145   cycles - 32 (1010)  push eax
16353   cycles - 32 (1010)  pop eax
36611   cycles - 32 (6010)  push mem
36377   cycles - 32 (6010)  pop mem
10900   cycles - 32 (6010)  lea eax,mem
32152   cycles - 32 (3010)  lea eax,[eax+1]
32146   cycles - 32 (3010)  add eax,immed
32150   cycles - 32 (3010)  sub eax,immed
32170   cycles - 32 (6010)  add eax,mem
32171   cycles - 32 (6010)  sub eax,mem
345857  cycles - 32 (2010)  @@: jmp @F
10844   cycles - 32 (2010)  @@: jz @F
346143  cycles - 32 (2010)  @@: jnz @F
10839   cycles - 32 (2010)  test eax,eax
16311   cycles - 32 (6010)  test eax,mem
10839   cycles - 32 (2010)  cmp eax,eax
16311   cycles - 32 (6010)  cmp eax,mem

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #8 on: July 14, 2014, 09:39:03 PM »
Instruction Timing

I parsed instructions from the Intel 8086 Family Architecture and tried to calibrate the clock cycle calculation to 1000 for one instruction, and use this as a base to compare different CPU's.

Code: [Select]
mov cloop,1
calibrate:
mov eax,cloop
mov count,eax
counter_begin TMCOUNT, HIGH_PRIORITY_CLASS
     l2:
lea eax,proc_x
call eax ; Instruction MOV EAX,1
dec count
jnz l2
mov eax,cloop
mov count,eax
counter_end
cmp eax,1000
jae @F
add cloop,1
jmp calibrate

Code: [Select]
Instruction Clock Cycle Calculation

1. AMD Athlon(tm) II X2 245 Processor
2. Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz

Instr.  Operands      Size  CPU1  CPU2
------------------------------------------------------
proc        [0]   0.1   0.1
AAA        [1]  14.1  10.3
AAD        [2]  14.0   4.5
AAM        [2]  36.5  54.8
AAS        [1]  14.1  10.3
ADC reg,reg        [2]   2.7   5.5
ADC mem,reg        [6]  19.1  20.0
ADC reg,mem        [6]   2.8   5.5
ADC reg,imm        [3]   2.8   5.4
ADC mem,imm        [7]  18.2  20.5
ADC acc,imm        [3]   2.8   5.5
ADD reg,reg        [2]   2.7   2.5
ADD mem,reg        [6]  19.1  16.1
ADD reg,mem        [6]   2.8   2.7
ADD reg,imm        [3]   2.8   2.6
ADD mem,imm        [7]  18.2  16.4
ADD acc,imm        [3]   2.8   2.7
AND reg,reg        [2]   2.7   2.5
AND mem,reg        [6]  19.1  17.8
AND reg,mem        [6]   2.8   2.8
AND reg,imm        [3]   2.8   2.6
AND mem,imm        [7]  18.2  16.5
AND acc,imm        [3]   2.8   2.5
BSF reg,reg        [3]  11.2   6.9
BSF reg,mem        [7]   8.5   6.9
BSR reg,reg        [3]  11.2   6.9
BSR reg,mem        [7]   8.5   7.2
BSWAP reg        [2]   2.7   2.8
BT reg,imm        [4]   1.0   1.6
BT mem,imm        [8]   1.6   1.6
BT reg,reg        [3]   1.1   1.5
BT mem,reg        [7]   5.7  14.4
BTC reg,imm        [4]   5.5   2.5
BTC mem,imm        [8]  19.4  16.7
BTC reg,reg        [3]   5.5   2.6
BTC mem,reg        [7]  14.0  17.4
BTR reg,imm        [4]   5.5   2.6
BTR mem,imm        [8]  19.4  16.5
BTR reg,reg        [3]   5.5   2.5
BTR mem,reg        [7]  22.1  17.4
BTS reg,imm        [4]   5.5   2.7
BTS mem,imm        [8]  19.4  16.6
BTS reg,reg        [3]   5.5   2.9
BTS mem,reg        [7]  22.1  17.5
CALL        [5]  51.8  12.0
CBW        [2]   2.7   2.6
CDQ        [1]   2.8   2.9
CLC        [1]   1.1   0.9
CLD        [1]   2.8  11.5
CMC        [1]   2.8   2.6
CMP reg,reg        [2]   1.0   1.1
CMP mem,reg        [6]   1.6   1.5
CMP reg,mem        [6]   1.6   1.8
CMP reg,imm        [3]   1.1   1.1
CMP mem,imm        [7]   1.5   2.9
CMP acc,imm        [3]   1.1   1.1
CMP r16,imm        [4]   1.0   1.0
CMP m16,imm        [8]   1.6   2.9
CMP r16,r16        [3]   1.1   1.1
CMP m16,r16        [7]   1.5   1.5
CMPSB        [1]   8.5  11.4
CMPSW        [2]   8.6  11.4
CMPSD        [1]   8.5  11.4
CMPXCHG reg,acc        [3]   5.7  14.4
CMPXCHG mem,acc        [7]  20.5  20.7
CWD        [2]   2.7   3.0
CWDE        [1]   2.8   2.9
DAA        [1]  16.8  11.9
DAS        [1]  19.7  11.9
DEC reg        [1]   2.8   2.5
DEC r16        [2]   2.7   2.6
DEC r08        [2]   2.7   2.5
DEC mem        [6]  19.1  16.5
DIV reg        [2]  50.6  52.6
DIV r08        [2]  45.0  49.3
DIV r16        [3]  50.5  55.7
DIV mem        [6]  53.3  52.4
DIV m08        [6]  44.9  49.0
DIV m16        [7]  53.3  55.5
ENTER        [4]  33.8  22.7
IDIV reg        [2]  61.8  53.3
IDIV r08        [2]  47.7  52.3
IDIV r16        [3]  61.7  55.9
IDIV mem        [6]  61.7  52.5
IDIV m08        [6]  53.4  49.3
IDIV m16        [7]  61.8  56.0
IMUL reg        [2]   8.4  12.0
IMUL r08        [2]   8.4   6.9
IMUL r16        [3]   8.4  11.7
IMUL mem        [6]   8.4  12.3
IMUL m08        [6]   8.4   6.9
IMUL m16        [7]   8.4  11.9
IMUL reg,acc        [3]   8.4   8.1
IMUL reg,acc,imm    [3]   2.9   2.8
INC reg        [1]   2.8   2.6
INC r16        [2]   2.7   2.6
INC r08        [2]   2.7   2.5
INC mem        [6]  20.0  16.4
JCXZ        [3]   2.0   3.4
JECXZ        [2]   2.1   7.0
JMP        [2]  30.5   8.1
JNZ        [2]  30.1  10.7
JZ        [2]   1.0   8.0
LAHF        [1]   5.7   2.9
LEA acc,mem        [6]   1.0   1.4
LEA reg,mem        [6]   1.0   1.4
LEA acc,[acc]      [2]   2.7   2.5
LEA reg,[reg]      [2]   2.7   2.5
LEA acc,[acc+imm]  [3]   2.8   2.5
LEA reg,[reg+imm]  [3]   2.8   2.5
LEAVE*        [5]  36.6  25.6
LODSB        [1]   5.7   3.5
LODSW        [2]   5.7   3.5
LODSD        [1]   5.7   2.9
LOOP        [2]  36.8  16.5
LOOPZ        [2]   8.5  19.0
LOOPNZ        [2]  41.0  24.5
MOV acc,imm        [5]   1.1   1.0
MOV reg,imm        [5]   1.1   1.1
MOV reg,mem        [6]   1.5   1.5
MOV r16,m16        [7]   2.8   2.6
MOVSB        [1]   8.5  11.6
MOVSW        [2]   8.5  11.5
MOVSD        [1]   8.5  11.4
MOVSX acc,al        [3]   2.8   2.5
MOVSX acc,m16        [7]   1.5   1.5
MOVSX reg,cl        [3]   2.8   2.6
MOVSX reg,m16        [7]   1.5   1.7
MOVZX acc,al        [3]   2.8   2.5
MOVZX acc,m16        [7]   1.5   1.5
MOVZX reg,cl        [3]   2.8   2.5
MOVZX reg,m16        [7]   1.5   1.8
MUL reg        [2]   8.4  12.0
MUL r16        [3]   8.4  11.5
MUL r08        [2]   8.4   6.8
MUL mem        [6]   8.4  12.3
NEG acc        [2]   2.7   2.6
NEG reg        [2]   2.7   2.5
NEG mem        [6]  19.1  16.2
NOP        [1]   1.0   0.9
NOT reg        [2]   2.7   2.8
NOT mem        [6]  19.1  16.5
OR reg,reg        [2]   2.7   2.5
OR mem,reg        [6]  19.1  16.1
OR reg,mem        [6]   2.8   2.6
OR reg,imm        [3]   2.8   2.6
OR mem,imm        [7]  18.3  16.4
OR acc,imm        [3]   2.8   2.5
PUSH acc        [1]   2.2   2.9
PUSH reg        [1]   2.2   2.9
PUSH mem        [6]   3.3   2.9
PUSHAD        [1]  16.9  22.8
PUSHFD        [1]   8.5   3.0
POP acc        [1]   1.6   1.6
POP reg        [1]   1.6   1.5
POP mem        [6]   3.3   3.1
POPAD*        [2]  32.3  45.2
POPFD*        [2]  38.9  64.7
RCL acc,1        [2]   2.7   5.7
RCL mem,1        [6]  19.1  21.8
RCL reg,cl        [2]  11.3  17.0
RCL mem,cl        [6]  20.5  28.2
RCL reg,16        [3]   8.5  17.0
RCL mem,16        [7]  23.3  28.2
RCR acc,1        [2]   2.7   5.8
RCR mem,1        [6]  19.2  21.9
RCR reg,cl        [2]   8.5  14.2
RCR mem,cl        [6]  21.5  25.7
RCR reg,16        [3]   8.5  14.2
RCR mem,16        [7]  21.6  25.9
ROL acc,1        [2]   2.7   3.0
ROL mem,1        [6]  19.1  17.9
ROL reg,cl        [2]   2.7   5.6
ROL mem,cl        [6]  19.1  16.5
ROL reg,16        [3]   2.8   2.9
ROL mem,16        [7]  18.2  18.2
ROR acc,1        [2]   2.7   2.9
ROR mem,1        [6]  19.1  18.1
ROR reg,cl        [2]   2.7   5.5
ROR mem,cl        [6]  19.1  16.6
ROR reg,16        [3]   2.8   2.9
ROR mem,16        [7]  18.2  18.5
SAHF        [1]   1.1   4.7
SAR acc,1        [2]   2.7   2.5
SAR mem,1        [6]  19.1  16.4
SAR reg,cl        [2]   2.7   5.5
SAR mem,cl        [6]  19.1  16.6
SAR reg,16        [3]   2.8   2.6
SAR mem,16        [7]  18.2  16.5
SBB reg,reg        [2]   2.7   5.5
SBB mem,reg        [6]  19.1  19.7
SBB reg,mem        [6]   2.8   5.4
SBB reg,imm        [3]   2.8   5.4
SBB acc,imm        [3]   2.8   5.7
SCASB        [1]   7.1   3.5
SCASW        [2]   6.7   3.5
SCASD        [1]   7.1   3.5
SETNB al        [3]   2.8   2.5
SETNB m08        [7]   2.8   3.5
SETB al        [3]   2.8   2.6
SETB m08        [7]   2.8   2.9
SETBE al        [3]   2.8   3.1
SETBE m08        [7]   2.8   2.9
SETZ al        [3]   2.8   2.9
SETZ m08        [7]   2.8   2.9
SETNZ al        [3]   2.8   2.6
SETNZ m08        [7]   2.8   2.9
SETL al        [3]   2.8   2.6
SETL m08        [7]   2.8   2.9
SETNL al        [3]   2.8   2.6
SETNL m08        [7]   2.8   2.9
SETLE al        [3]   2.8   2.8
SETLE m08        [7]   2.8   2.9
SETG al        [3]   2.8   2.5
SETG m08        [7]   2.8   2.9
SETS al        [3]   2.8   2.6
SETS m08        [7]   2.8   2.9
SETNS al        [3]   2.8   2.6
SETNS m08        [7]   2.8   2.9
SETC al        [3]   2.8   2.8
SETC m08        [7]   2.8   2.9
SETNC al        [3]   2.8   2.5
SETNC m08        [7]   2.8   2.9
SETO al        [3]   2.8   2.6
SETO m08        [7]   2.8   2.9
SETNO al        [3]   2.8   2.5
SETNO m08        [7]   2.8   2.9
SETP al        [3]   2.8   2.7
SETP m08        [7]   2.8   2.9
SETPE al        [3]   2.8   2.6
SETPE m08        [7]   2.8   2.9
SETNP al        [3]   2.8   2.8
SETNP m08        [7]   2.8   2.9
SETPO al        [3]   2.8   2.5
SETPO m08        [7]   2.8   2.9
SHL acc,1        [2]   2.7   2.6
SHL mem,1        [6]  19.1  16.6
SHL reg,cl        [2]   2.7   5.6
SHL mem,cl        [6]  19.1  16.7
SHL reg,16        [3]   2.8   2.6
SHL mem,16        [7]  18.2  16.5
SHR acc,1        [2]   2.7   2.6
SHR mem,1        [6]  19.1  16.4
SHR reg,cl        [2]   2.7   5.5
SHR mem,cl        [6]  19.1  16.6
SHR reg,16        [3]   2.8   2.5
SHR mem,16        [7]  18.2  16.4
SHLD acc,reg,16     [4]   8.5   2.5
SHLD mem,reg,16     [8]  18.3  16.7
SHLD acc,reg,cl     [3]   8.5   5.5
SHLD mem,reg,cl     [7]  19.4  16.5
SHRD acc,reg,16     [4]   8.5   2.5
SHRD mem,reg,16     [8]  18.3  19.2
SHRD acc,reg,cl     [3]   8.5   6.6
SHRD mem,reg,cl     [7]  19.4  19.9
SMSW ax        [3]   5.7  28.2
SMSW m16        [7]   5.7  25.5
STC        [1]   1.1   1.0
STD        [1]   5.6  13.0
STOSB        [1]   5.7   3.0
STOSW        [2]   5.7   3.0
STOSD        [1]   5.7   2.9
SUB reg,reg        [2]   1.0   1.2
SUB mem,reg        [6]  19.1  16.1
SUB reg,mem        [6]   2.8   2.6
SUB reg,imm        [3]   2.8   2.6
SUB acc,imm        [3]   2.8   2.8
TEST reg,reg        [2]   1.0   1.0
TEST mem,reg        [6]   1.6   1.5
TEST reg,mem        [6]   1.6   1.5
TEST reg,imm        [6]   1.0   1.1
TEST acc,imm        [5]   1.1   1.1
XCHG reg,reg        [2]   2.9   5.9
XCHG reg,mem        [6]  44.9  64.7
XCHG mem,reg        [6]  45.7  64.8
XOR reg,reg        [2]   1.1   0.8
XOR mem,reg        [6]  19.9  16.0
XOR reg,mem        [6]   2.8   2.9
XOR reg,imm        [3]   2.8   2.6
XOR mem,imm        [7]  18.2  16.2
XOR acc,imm        [3]   2.8   2.6

* LEAVE = LEAVE - ENTER
* POPAD = POPAD - PUSHAD
* POPFD = POPFD - PUSHFD

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #9 on: July 14, 2014, 10:09:22 PM »
here is the source code for the instruction calculation
however, you need ASM and JWLINK to build

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #10 on: July 15, 2014, 02:10:35 AM »
hard to test individual instructions
so much relies on the surrounding code

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #11 on: July 15, 2014, 05:46:29 AM »
hard to test individual instructions
it's actually very easy, in fact the easiest ting to test

        counter_begin 1000, HIGH_PRIORITY_CLASS
        repeat 1000
        lea eax,[eax]
        endm
        counter_end


knowing the approximate time difference between instructions is also useful

        mov     eax,[ebx]   ; 1
        mov     [ebx],ecx   ; 2
        mov     ecx,eax     ; 3 cycles
        ...
        xchg    ecx,[ebx]   ; 50 cycles

this is the old document I have used for this

Agner Fog have a Lists of instruction latencies
Code: [Select]
MOV     r,r     1   1  1/3 ALU
MOV     r,i     1   1  1/3 ALU
MOV     r8,m8   1   4  1/2 ALU, AGU
...
DIV     r8/m8   32  24  23 ALU
DIV     r16/m16 47  24  23 ALU
DIV     r32/m32 79  40  40 ALU
IDIV    r8      41  17  17 ALU
IDIV    r16     56  25  25 ALU
IDIV    r32     88  41  41 ALU
IDIV    m8      42  17  17 ALU
IDIV    m16     57  25  25 ALU
IDIV    m32     89  41  41 ALU
seems from this the test may not be that far off

Quote
so much relies on the surrounding code
that's true, hence the problem with the test
location, alignment, and the issue of caches

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #12 on: July 16, 2014, 08:15:11 AM »
I created a new testbed where the code location problem is fixed by having a fixed location for the function (as Dave pointed out). And (as Dave pointed out) the cache problem could be solved by not linking in the functions to test. This is solved by using the –bin switch in JWASM to create a binary of the function.

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #13 on: July 20, 2014, 11:43:43 PM »
It turns out the data section also have an impact on the result. The code is executed in the data? section so I now (again  :P) use the method suggested by Dave and load the functions in the code section.

In addition to this I flush the code buffer for each new function loaded using FlushInstructionCache.

readit proc uses ebx esi edi fname
   invoke GetCurrentProcess
   invoke FlushInstructionCache,eax,addr proc_x,size_p


I tested some string functions using the strlen approach.

Code: [Select]
strcpy  proc uses edi esi dst, src
mov edi,dst
mov esi,src
mov eax,[esi] ; test the first 4 byte for zero..
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz tail
if 1
;
; at this point there is at least 4 byte in src
;
mov ecx,esi ; align pointer to 4
and ecx,3
mov eax,[esi] ; copy the first 4 bytes
mov [edi],eax
add edi,ecx ; set new offset (+0..3)
add esi,ecx
mov eax,[esi] ; test new offset 0..3
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz @1
endif
align 4 ; align main loop
@@: mov eax,[esi] ; copy aligned DWORD's
mov [edi],eax
add edi,4
add esi,4
mov eax,[esi]
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jz @B
;
; we now have at least 4 byte in the buffer
;
@1: bsf ecx,ecx
shr ecx,3
mov eax,[esi+ecx-3] ; copy 1..4 byte with overlap
mov [edi+ecx-3],eax
align 4
toend:  mov eax,dst
ret
align 4
tail: mov eax,[esi] ; for strings < 4 byte..
mov [edi],al ; 1893/1505/1602 - 3
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
jmp toend
strcpy  endp
here is the SSE version using the same approach
Code: [Select]
SSE_strcpy  proc uses esi edi dst, src
mov edi,dst
mov esi,src
;
; test the first 16 byte for zero..
;
xorps xmm1,xmm1 ; clear xmm1 for compare
movdqu  xmm0,[esi] ; get 16 byte from source
pcmpeqb xmm0,xmm1 ; compare bytes
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz tail
if 0
mov ecx,esi ; align pointer to 16
and ecx,15
movdqu  xmm0,[esi] ; copy 16 byte
movdqu  [edi],xmm0 ;
add edi,ecx ; set new offset
add esi,ecx
pcmpeqb xmm0,xmm1 ; compare bytes
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz @Z
endif
;
; main loop
;
align 4
@@: movdqu  xmm0,[esi] ; copy
movdqu  [edi],xmm0 ;
add esi,16
add edi,16
pcmpeqb xmm0,xmm1 ; test for zero
pmovmskb ecx,xmm0 ;
test ecx,ecx
jz @B
@Z: bsf ecx,ecx
movdqu  xmm0,[esi+ecx-16+1]
movdqu  [edi+ecx-16+1],xmm0
align 4
toend:
mov eax,dst
ret
align 4 ; for strings < 16 byte..
tail:
if 0
bsf ecx,ecx
mov [edi],ch
jz toend
align 4
@@: mov al,[esi]
mov [edi],al
inc edi
inc esi
dec ecx
jnz @B
mov [edi],cl
jmp toend
endif
if 0
bsf ecx,ecx
inc ecx
@@: mov al,[esi+ecx-1]
mov [edi+ecx-1],al
dec ecx
jnz @B
jmp toend
endif
if 0 ; 4387 - small..
bsf ecx,ecx
inc ecx
rep movsb
jmp toend
endif
if 1 ; 1709 - large..
repeat  3
mov eax,[esi]
mov [edi],al
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
test ah,ah
jz toend
add esi,4
add edi,4
endm
mov eax,[esi]
mov [edi],al
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
jmp toend
endif
SSE_strcpy endp

The first run is aligned strings, second unaligned, and the last a small for the tail.

AMD Athlon(tm) II X2 245 Processor (SSE3)
STRCHR-----------------------------------
597240  cycles - 10 (  0) 0: crt_strcpy
1462096 cycles - 10 ( 37) 1: x
544466  cycles - 10 (151) 2: x
168543  cycles - 10 (251) 3: SSE2

616369  cycles - 10 (  0) 0: crt_strcpy
2062799 cycles - 10 ( 37) 1: x
561664  cycles - 10 (151) 2: x
236658  cycles - 10 (251) 3: SSE2

1850    cycles - 99 (  0) 0: crt_strcpy
5161    cycles - 99 ( 37) 1: x
1603    cycles - 99 (151) 2: x
1993    cycles - 99 (251) 3: SSE2


here is a version of strchr that use a DWORD as argument
the char needs to be populated from ‘c’ to ‘cccc’
Code: [Select]
strchr proc uses ebx string, char
mov eax,string
mov ebx,char
not ebx
align 4
lup: mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jnz tail
add eax,4
sub edx,ebx
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jz lup
bsf ecx,ecx
shr ecx,3
lea eax,[ecx+eax-4]
align 4
toend:  ret
align 4
tail: movzx ebx,byte ptr char
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
xor eax,eax
jmp toend
strchr endp

timeings for strchr:
82497   cycles - (  0) 0: crt_strchr
91200   cycles - ( 29) 1: x
53679   cycles - (119) 2: 'c'
51788   cycles - (107) 3: 'cccc'

82932   cycles - (  0) 0: crt_strchr
91066   cycles - ( 29) 1: x
59540   cycles - (119) 2: 'c'
54366   cycles - (107) 3: 'cccc'

43756   cycles - (  0) 0: crt_strchr
32192   cycles - ( 29) 1: x
17620   cycles - (119) 2: 'c'
16582   cycles - (107) 3: 'cccc'


timeings for strrchr:
329384  cycles - 10 (  0) 0: crt_strrchr
330506  cycles - 10 ( 40) 1: strrchr
330109  cycles - 10 ( 40) 2: x
109653  cycles - 10 ( 66) 3: x
95977   cycles - 10 ( 72) 4: aligned
123257  cycles - 10 (112) 5: SSE

328457  cycles - 10 (  0) 0: crt_strrchr
328311  cycles - 10 ( 40) 1: strrchr
329352  cycles - 10 ( 40) 2: x
119861  cycles - 10 ( 66) 3: x
102823  cycles - 10 ( 72) 4: aligned
123251  cycles - 10 (112) 5: SSE

27515   cycles - 500 (  0) 0: crt_strrchr
26019   cycles - 500 ( 40) 1: strrchr
26019   cycles - 500 ( 40) 2: x
12508   cycles - 500 ( 66) 3: x
11510   cycles - 500 ( 72) 4: aligned
12007   cycles - 500 (112) 5: SSE


the last run is just a single "c" repeated 500 times

this is a new approach to strstr
Code: [Select]
strstr  proc uses edx ebx dst, src
mov eax,src
mov ebx,[eax]
mov bh,bl
mov eax,ebx
shl ebx,16
mov bx,ax
mov eax,dst
not ebx
align 4
lup: mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jnz tail
add eax,4
sub edx,ebx
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jz lup
bsf ecx,ecx
shr ecx,3
lea eax,[ecx+eax-4]
xor ecx,ecx
mov edx,src
mov dst,eax
inc eax
inc edx
align 4
lup2: xor cl,[edx]
jz found
sub cl,[eax]
jnz lup
inc eax
inc edx
jmp lup2
align 4
found:  mov eax,dst
align 4
toend:  ret
align 4
tail: shr ecx,8
jz @F
not ebx
cmp [eax],bl
je @3
shr ecx,8
jz @F
inc eax
cmp [eax],bl
je @3
shr ecx,8
jz @F
inc eax
cmp [eax],bl
je @3
@@: xor eax,eax
jmp toend
@3: xor ecx,ecx
mov edx,src
mov dst,eax
inc eax
inc edx
align 4
lup3: xor cl,[edx]
jz found
sub cl,[eax]
jnz toend
inc eax
inc edx
jmp lup3
strstr  endp

time for strstr (the above is 5)
537610  cycles -  10 (  0) 0: crt_strstr
918998  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
553673  cycles -  10 ( 46) 2: strstr
586127  cycles -  10 ( 57) 3: x
309336  cycles -  10 (148) 4: x
207128  cycles -  10 (176) 5: x

533403  cycles -  10 (  0) 0: crt_strstr
938340  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
554530  cycles -  10 ( 46) 2: strstr
584338  cycles -  10 ( 57) 3: x
322246  cycles -  10 (148) 4: x
215947  cycles -  10 (176) 5: x

14019   cycles - 500 (  0) 0: crt_strstr
27515   cycles - 500 (  0) 1: InString(1,dst,src) - 1 + dst
13264   cycles - 500 ( 46) 2: strstr
14518   cycles - 500 ( 57) 3: x
14518   cycles - 500 (148) 4: x
20519   cycles - 500 (176) 5: x

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #14 on: July 21, 2014, 04:40:44 AM »
those tests run way too fast to get reliable readings

for best results:
1) bind to a single core
2) wait about 750 mS before performing any tests - this allows the system to settle
3) adjust individual loop counts so that each test pass takes about 0.5 seconds

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
170724  cycles - (  0) 0: crt_strchr
172046  cycles - ( 29) 1: x
70540   cycles - (119) 2: 'c'
62934   cycles - (107) 3: 'cccc'

170343  cycles - (  0) 0: crt_strchr
167063  cycles - ( 29) 1: x
89640   cycles - (119) 2: 'c'
82154   cycles - (107) 3: 'cccc'

240068  cycles - (  0) 0: crt_strchr
87783   cycles - ( 29) 1: x
25549   cycles - (119) 2: 'c'
23995   cycles - (107) 3: 'cccc'
--- ok ---

H:\nidudString\string\strchr => strchr

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
209073  cycles - (  0) 0: crt_strchr
221060  cycles - ( 29) 1: x
80107   cycles - (119) 2: 'c'
83854   cycles - (107) 3: 'cccc'

198648  cycles - (  0) 0: crt_strchr
211263  cycles - ( 29) 1: x
106992  cycles - (119) 2: 'c'
96168   cycles - (107) 3: 'cccc'

253182  cycles - (  0) 0: crt_strchr
84552   cycles - ( 29) 1: x
26531   cycles - (119) 2: 'c'
36056   cycles - (107) 3: 'cccc'

they're all over the place   :P