The MASM Forum

General => The Laboratory => Topic started by: nidud on July 12, 2014, 09:15:44 PM

Title: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 09:15:44 PM
This subject is from the old forum (http://www.masmforum.com/board/index.php?topic=11454.msg90807#msg90807)

When code is executed it is cached more like the file cache system so code/files are reused to improve performance. This is a problem when using clock timings on file access/read/copy, and also when similar algorithms of code are compared.

I don’t know exactly how this cache system works but it could be illustrated like this:
Code: [Select]
SSE16 macro x
.data
info_&x&  db "memcpy SSE 16",0
.code
test_&x& proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
and ecx,-16
jz tail
     @@:
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
jnz @B
mov ecx,count
movdqu  xmm0,[esi+ecx-16]
movdqu  [edi+ecx-16],xmm0
  toend:
mov eax,dst
ret
   tail:
mov ecx,count
rep stosb
jmp toend
db 99 dup(90h)
test_&x& endp
size_&x& equ $ - test_&x&
endm

SSE16 0
SSE16 1
SSE16 2
SSE16 3
SSE16 4
SSE16 5

which gives this (random) result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------
1232712 cycles - 2048..4096  (164) A memcpy SSE 16
877928  cycles - 2048..4096  (164) A memcpy SSE 16
1232112 cycles - 2048..4096  (164) A memcpy SSE 16
878896  cycles - 2048..4096  (164) A memcpy SSE 16
1233148 cycles - 2048..4096  (164) A memcpy SSE 16
879296  cycles - 2048..4096  (164) A memcpy SSE 16

1233667 cycles - 2048..4096  (164) U memcpy SSE 16
1016403 cycles - 2048..4096  (164) U memcpy SSE 16
1234898 cycles - 2048..4096  (164) U memcpy SSE 16
1016440 cycles - 2048..4096  (164) U memcpy SSE 16
1231145 cycles - 2048..4096  (164) U memcpy SSE 16
1019004 cycles - 2048..4096  (164) U memcpy SSE 16

However, the result repeat itself (on the same CPU/OS), so the pattern is more consistent than random. The pattern could also be manipulated by changing the size of the proc or insert random sized code above it, so what to do?

Disable the cache during the test would be nice, if you know how, which I don't, so the simplest solution seems to be overflowing the system with garbage between tests and not have any code above the first one.

The actual size and scope of the cache may differ from various OS/CPU’s but from the result at least 512 byte. Newer systems may also be larger and more selective.

I wrote a new test-bed where a 4096 byte buffer was added to each proc to try flushing the code cache between them:
Code: [Select]
jmp toend
db 99 dup(90h)
test_&x& endp
size_&x& equ $ - test_&x&
fill_&x& db PROCSIZE - size_&x& dup(0)
endm

The result now is slow and more equal so it seems to have an effect
Code: [Select]
1229362 cycles - 2048..4096  (164) A memcpy SSE 16
1234968 cycles - 2048..4096  (164) A memcpy SSE 16
1231486 cycles - 2048..4096  (164) A memcpy SSE 16
1229533 cycles - 2048..4096  (164) A memcpy SSE 16
1235589 cycles - 2048..4096  (164) A memcpy SSE 16
1229830 cycles - 2048..4096  (164) A memcpy SSE 16

1234907 cycles - 2048..4096  (164) U memcpy SSE 16
1232686 cycles - 2048..4096  (164) U memcpy SSE 16
1231493 cycles - 2048..4096  (164) U memcpy SSE 16
1235533 cycles - 2048..4096  (164) U memcpy SSE 16
1231223 cycles - 2048..4096  (164) U memcpy SSE 16
1231564 cycles - 2048..4096  (164) U memcpy SSE 16

Intel seems more selective with regards to what ends up in the cache and don't seems to fall,  at least to the same extent, for the dup(0) trap.
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
------------------------------------------------------
1137020 cycles - 2048..4096  (164) A memcpy SSE 16
1143048 cycles - 2048..4096  (164) A memcpy SSE 16
1155096 cycles - 2048..4096  (164) A memcpy SSE 16
910107  cycles - 2048..4096  (164) A memcpy SSE 16
924623  cycles - 2048..4096  (164) A memcpy SSE 16
865810  cycles - 2048..4096  (164) A memcpy SSE 16

980038  cycles - 2048..4096  (164) U memcpy SSE 16
925320  cycles - 2048..4096  (164) U memcpy SSE 16
1021239 cycles - 2048..4096  (164) U memcpy SSE 16
881953  cycles - 2048..4096  (164) U memcpy SSE 16
965462  cycles - 2048..4096  (164) U memcpy SSE 16
850145  cycles - 2048..4096  (164) U memcpy SSE 16
...
978347  cycles - 2048..4096  (164) A memcpy SSE 16
892772  cycles - 2048..4096  (164) A memcpy SSE 16
927040  cycles - 2048..4096  (164) A memcpy SSE 16
917530  cycles - 2048..4096  (164) A memcpy SSE 16
913893  cycles - 2048..4096  (164) A memcpy SSE 16
947837  cycles - 2048..4096  (164) A memcpy SSE 16

932098  cycles - 2048..4096  (164) U memcpy SSE 16
925095  cycles - 2048..4096  (164) U memcpy SSE 16
914343  cycles - 2048..4096  (164) U memcpy SSE 16
924735  cycles - 2048..4096  (164) U memcpy SSE 16
920416  cycles - 2048..4096  (164) U memcpy SSE 16
922230  cycles - 2048..4096  (164) U memcpy SSE 16

The test works well for the AMD in this case, but it should be possible to write a StuffCache proc of some size with "strange" upcode to fill up the cache with some unrelated code. Calling this proc before each test may then clear (or replace) the cache content.
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:06:49 PM
i suppose you could use VirtualProtect to allow writes in the .CODE section
then, copy the code under test into a common code address space before executing it

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx)

to help speed testing up a little....
you could use different counter_begin loop count values for the short and long tests
i try to select a loop count that yields about 0.5 seconds per pass
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:11:45 PM
prescott w/htt - xp sp3
unaligned
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4925905 cycles - 2048..4096  (164) A memcpy SSE 16
4933332 cycles - 2048..4096  (164) A memcpy SSE 16
4953203 cycles - 2048..4096  (164) A memcpy SSE 16
4963909 cycles - 2048..4096  (164) A memcpy SSE 16
4923198 cycles - 2048..4096  (164) A memcpy SSE 16
4941277 cycles - 2048..4096  (164) A memcpy SSE 16

11502669        cycles - 2048..4096  (164) U memcpy SSE 16
11487135        cycles - 2048..4096  (164) U memcpy SSE 16
11564951        cycles - 2048..4096  (164) U memcpy SSE 16
11570118        cycles - 2048..4096  (164) U memcpy SSE 16
11497558        cycles - 2048..4096  (164) U memcpy SSE 16
11526087        cycles - 2048..4096  (164) U memcpy SSE 16

aligned
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4935114 cycles - 2048..4096  (164) A memcpy SSE 16
4934727 cycles - 2048..4096  (164) A memcpy SSE 16
4942523 cycles - 2048..4096  (164) A memcpy SSE 16
4924574 cycles - 2048..4096  (164) A memcpy SSE 16
4924658 cycles - 2048..4096  (164) A memcpy SSE 16
4937763 cycles - 2048..4096  (164) A memcpy SSE 16

11490869        cycles - 2048..4096  (164) U memcpy SSE 16
11481780        cycles - 2048..4096  (164) U memcpy SSE 16
11616596        cycles - 2048..4096  (164) U memcpy SSE 16
11530420        cycles - 2048..4096  (164) U memcpy SSE 16
11488318        cycles - 2048..4096  (164) U memcpy SSE 16
11504392        cycles - 2048..4096  (164) U memcpy SSE 16
Title: Re: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 11:04:51 PM
Your CPU seems to be free of this problem so that makes the test result more reliable. My CPU are a bit moody.

copy the code under test into a common code address space before executing it

brilliant: a location problem solved by relocation  :lol:

I used the data segment for the code and that seems to work just fine
Code: [Select]
.data
proc_x  db PAGESIZE dup(?)
...
lea edi,proc_x
lea esi,test_&x&
mov ecx,size_&x&
rep movsb
invoke  Sleep, 200
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,l1
.repeat
push edi
push offset source
push a2
lea eax,proc_x
call eax
; invoke  test_&x&,a2,addr source,edi
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 11:21:46 PM
hmmm - that seems wrong
i thought you had to be in a code section to assemble instructions
guess i've never tried it - lol

but - there is nothing to prevent you from putting a proc in the code section and copying it to another address
so long as you allow writes into the affected pages with VirtualProtect and PAGE_EXECUTE_READWRITE

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 02:28:37 AM
I don't think it's a problem using the data segment
Code: [Select]
.data
db "I'm in the _DATA segment ;-)",13,10,0
hello_dave proc
invoke crt_printf,$ - 31
ret
hello_dave endp
.code
start:
call hello_dave

I tried the strlen test and it worked fine
that is, this work:
Code: [Select]
test_0  proc uses edi string:dword
invoke  crt_strlen,string
ret
test_0  endp
but this failed:
Code: [Select]
test_5 proc string:dword
void len(string)
ret
test_5  endp
suspect putting it in the code segment will be the same though
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 13, 2014, 03:03:15 AM
i hadn't thought about that
but - most win32 CALL's are near relative
so, you'd have to translate the target addresses

but - for testing algorithm code that doesn't make any calls, like loops, etc,
the branch addresses are relative, but the target moves with the code   :P

if you wanted to use calls or invokes in movable code,
you could store the branch address and use CALL DWORD PTR lpfnFunction
or MOV EAX,Function and CALL EAX
Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 04:10:22 AM
well, here is a cute one
Code: [Select]
ifdef MASMBASIC
 include \masm32\MasmBasic\MasmBasic.inc
else
 include \masm32\include\masm32rt.inc
endif
.686
.xmm
include \masm32\macros\timers.asm

PAGESIZE equ 4096

.data?
align 16
proc_x  db PAGESIZE*4 dup(?)
error dd ?

.data
immed equ 1
mem dd ?
mem16 dw ?,?
mem8 db ?,?,?,?

info_0  db "prolog",0
.code
test_0  proc
local char[1]:byte
ret
test_0  endp
size_0  equ $ - test_0

UPCODE macro x,asm:VARARG
.data
info_&x&  db "&asm&",0
.code
test_&x& proc
local char[1]:byte
repeat 1000
asm
endm
@@: ret
test_&x&  endp
size_&x&  equ $ - test_&x&
endm

UPCODE 1,mov eax,immed
UPCODE 2,mov eax,mem
UPCODE 3,mov ax,mem16
UPCODE 4,mov al,mem8
UPCODE 5,movzx eax,mem16
UPCODE 6,movsx eax,mem8
UPCODE 7,push eax
UPCODE 8,pop eax
UPCODE 9,push mem
UPCODE A,pop mem
UPCODE B,lea eax,mem
UPCODE C,lea eax,[eax+1]
UPCODE D,add eax,immed
UPCODE E,sub eax,immed
UPCODE F,add eax,mem
UPCODE G,sub eax,mem
UPCODE H,@@: jmp @F
UPCODE I,@@: jz @F
UPCODE J,@@: jnz @F
UPCODE K,test eax,eax
UPCODE L,test eax,mem
UPCODE M,cmp eax,eax
UPCODE N,cmp eax,mem

procs equ <for x,<0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N>>

;***********************************************************************************************

; get the cycle count for each algo

test_algo macro x,l1
push esi
push edi
push ebx
lea edi,proc_x
lea esi,test_&x&
mov ecx,size_&x&
rep movsb
invoke  Sleep, 200
counter_begin 100, HIGH_PRIORITY_CLASS
mov edi,l1
lea ebx,proc_x
.repeat
call ebx
dec edi
.until  !edi
counter_end
printf("%d\tcycles - %d (%4d)  %s\n",eax,l1,size_&x&,addr info_&x&)
pop ebx
pop edi
pop esi
endm

validate_x macro x ; test if the algo actually works..
invoke test_&x& ; and spin-up the CPU..
endm

;***********************************************************************************************

main proc
procs
    validate_x x
    cmp error,0
    jne toend
endm
procs
    test_algo x,32
endm
toend:
ret
main endp

;***********************************************************************************************

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu  ; simple, no printing, just returns SSE level
  push 1, call ShowCpu  ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
.Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit
  endif
  popad
  ret 4
ShowCpu endp

start:
print " ", 13, 10
push 1
call ShowCpu ; print brand string and SSE level
print "------------------------------------------------------", 13, 10
call main
inkey chr$("--- ok ---", 13)
exit

end start

result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
--------------------------------------------
266     cycles - 32 (  10)  prolog
10916   cycles - 32 (5010)  mov eax,immed
16301   cycles - 32 (5010)  mov eax,mem
32171   cycles - 32 (6010)  mov ax,mem16
32164   cycles - 32 (5010)  mov al,mem8
16316   cycles - 32 (7010)  movzx eax,mem16
16317   cycles - 32 (7010)  movsx eax,mem8
24145   cycles - 32 (1010)  push eax
16353   cycles - 32 (1010)  pop eax
36611   cycles - 32 (6010)  push mem
36377   cycles - 32 (6010)  pop mem
10900   cycles - 32 (6010)  lea eax,mem
32152   cycles - 32 (3010)  lea eax,[eax+1]
32146   cycles - 32 (3010)  add eax,immed
32150   cycles - 32 (3010)  sub eax,immed
32170   cycles - 32 (6010)  add eax,mem
32171   cycles - 32 (6010)  sub eax,mem
345857  cycles - 32 (2010)  @@: jmp @F
10844   cycles - 32 (2010)  @@: jz @F
346143  cycles - 32 (2010)  @@: jnz @F
10839   cycles - 32 (2010)  test eax,eax
16311   cycles - 32 (6010)  test eax,mem
10839   cycles - 32 (2010)  cmp eax,eax
16311   cycles - 32 (6010)  cmp eax,mem
Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 09:39:03 PM
Instruction Timing

I parsed instructions from the Intel 8086 Family Architecture and tried to calibrate the clock cycle calculation to 1000 for one instruction, and use this as a base to compare different CPU's.

Code: [Select]
mov cloop,1
calibrate:
mov eax,cloop
mov count,eax
counter_begin TMCOUNT, HIGH_PRIORITY_CLASS
     l2:
lea eax,proc_x
call eax ; Instruction MOV EAX,1
dec count
jnz l2
mov eax,cloop
mov count,eax
counter_end
cmp eax,1000
jae @F
add cloop,1
jmp calibrate

Code: [Select]
Instruction Clock Cycle Calculation

1. AMD Athlon(tm) II X2 245 Processor
2. Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz

Instr.  Operands      Size  CPU1  CPU2
------------------------------------------------------
proc        [0]   0.1   0.1
AAA        [1]  14.1  10.3
AAD        [2]  14.0   4.5
AAM        [2]  36.5  54.8
AAS        [1]  14.1  10.3
ADC reg,reg        [2]   2.7   5.5
ADC mem,reg        [6]  19.1  20.0
ADC reg,mem        [6]   2.8   5.5
ADC reg,imm        [3]   2.8   5.4
ADC mem,imm        [7]  18.2  20.5
ADC acc,imm        [3]   2.8   5.5
ADD reg,reg        [2]   2.7   2.5
ADD mem,reg        [6]  19.1  16.1
ADD reg,mem        [6]   2.8   2.7
ADD reg,imm        [3]   2.8   2.6
ADD mem,imm        [7]  18.2  16.4
ADD acc,imm        [3]   2.8   2.7
AND reg,reg        [2]   2.7   2.5
AND mem,reg        [6]  19.1  17.8
AND reg,mem        [6]   2.8   2.8
AND reg,imm        [3]   2.8   2.6
AND mem,imm        [7]  18.2  16.5
AND acc,imm        [3]   2.8   2.5
BSF reg,reg        [3]  11.2   6.9
BSF reg,mem        [7]   8.5   6.9
BSR reg,reg        [3]  11.2   6.9
BSR reg,mem        [7]   8.5   7.2
BSWAP reg        [2]   2.7   2.8
BT reg,imm        [4]   1.0   1.6
BT mem,imm        [8]   1.6   1.6
BT reg,reg        [3]   1.1   1.5
BT mem,reg        [7]   5.7  14.4
BTC reg,imm        [4]   5.5   2.5
BTC mem,imm        [8]  19.4  16.7
BTC reg,reg        [3]   5.5   2.6
BTC mem,reg        [7]  14.0  17.4
BTR reg,imm        [4]   5.5   2.6
BTR mem,imm        [8]  19.4  16.5
BTR reg,reg        [3]   5.5   2.5
BTR mem,reg        [7]  22.1  17.4
BTS reg,imm        [4]   5.5   2.7
BTS mem,imm        [8]  19.4  16.6
BTS reg,reg        [3]   5.5   2.9
BTS mem,reg        [7]  22.1  17.5
CALL        [5]  51.8  12.0
CBW        [2]   2.7   2.6
CDQ        [1]   2.8   2.9
CLC        [1]   1.1   0.9
CLD        [1]   2.8  11.5
CMC        [1]   2.8   2.6
CMP reg,reg        [2]   1.0   1.1
CMP mem,reg        [6]   1.6   1.5
CMP reg,mem        [6]   1.6   1.8
CMP reg,imm        [3]   1.1   1.1
CMP mem,imm        [7]   1.5   2.9
CMP acc,imm        [3]   1.1   1.1
CMP r16,imm        [4]   1.0   1.0
CMP m16,imm        [8]   1.6   2.9
CMP r16,r16        [3]   1.1   1.1
CMP m16,r16        [7]   1.5   1.5
CMPSB        [1]   8.5  11.4
CMPSW        [2]   8.6  11.4
CMPSD        [1]   8.5  11.4
CMPXCHG reg,acc        [3]   5.7  14.4
CMPXCHG mem,acc        [7]  20.5  20.7
CWD        [2]   2.7   3.0
CWDE        [1]   2.8   2.9
DAA        [1]  16.8  11.9
DAS        [1]  19.7  11.9
DEC reg        [1]   2.8   2.5
DEC r16        [2]   2.7   2.6
DEC r08        [2]   2.7   2.5
DEC mem        [6]  19.1  16.5
DIV reg        [2]  50.6  52.6
DIV r08        [2]  45.0  49.3
DIV r16        [3]  50.5  55.7
DIV mem        [6]  53.3  52.4
DIV m08        [6]  44.9  49.0
DIV m16        [7]  53.3  55.5
ENTER        [4]  33.8  22.7
IDIV reg        [2]  61.8  53.3
IDIV r08        [2]  47.7  52.3
IDIV r16        [3]  61.7  55.9
IDIV mem        [6]  61.7  52.5
IDIV m08        [6]  53.4  49.3
IDIV m16        [7]  61.8  56.0
IMUL reg        [2]   8.4  12.0
IMUL r08        [2]   8.4   6.9
IMUL r16        [3]   8.4  11.7
IMUL mem        [6]   8.4  12.3
IMUL m08        [6]   8.4   6.9
IMUL m16        [7]   8.4  11.9
IMUL reg,acc        [3]   8.4   8.1
IMUL reg,acc,imm    [3]   2.9   2.8
INC reg        [1]   2.8   2.6
INC r16        [2]   2.7   2.6
INC r08        [2]   2.7   2.5
INC mem        [6]  20.0  16.4
JCXZ        [3]   2.0   3.4
JECXZ        [2]   2.1   7.0
JMP        [2]  30.5   8.1
JNZ        [2]  30.1  10.7
JZ        [2]   1.0   8.0
LAHF        [1]   5.7   2.9
LEA acc,mem        [6]   1.0   1.4
LEA reg,mem        [6]   1.0   1.4
LEA acc,[acc]      [2]   2.7   2.5
LEA reg,[reg]      [2]   2.7   2.5
LEA acc,[acc+imm]  [3]   2.8   2.5
LEA reg,[reg+imm]  [3]   2.8   2.5
LEAVE*        [5]  36.6  25.6
LODSB        [1]   5.7   3.5
LODSW        [2]   5.7   3.5
LODSD        [1]   5.7   2.9
LOOP        [2]  36.8  16.5
LOOPZ        [2]   8.5  19.0
LOOPNZ        [2]  41.0  24.5
MOV acc,imm        [5]   1.1   1.0
MOV reg,imm        [5]   1.1   1.1
MOV reg,mem        [6]   1.5   1.5
MOV r16,m16        [7]   2.8   2.6
MOVSB        [1]   8.5  11.6
MOVSW        [2]   8.5  11.5
MOVSD        [1]   8.5  11.4
MOVSX acc,al        [3]   2.8   2.5
MOVSX acc,m16        [7]   1.5   1.5
MOVSX reg,cl        [3]   2.8   2.6
MOVSX reg,m16        [7]   1.5   1.7
MOVZX acc,al        [3]   2.8   2.5
MOVZX acc,m16        [7]   1.5   1.5
MOVZX reg,cl        [3]   2.8   2.5
MOVZX reg,m16        [7]   1.5   1.8
MUL reg        [2]   8.4  12.0
MUL r16        [3]   8.4  11.5
MUL r08        [2]   8.4   6.8
MUL mem        [6]   8.4  12.3
NEG acc        [2]   2.7   2.6
NEG reg        [2]   2.7   2.5
NEG mem        [6]  19.1  16.2
NOP        [1]   1.0   0.9
NOT reg        [2]   2.7   2.8
NOT mem        [6]  19.1  16.5
OR reg,reg        [2]   2.7   2.5
OR mem,reg        [6]  19.1  16.1
OR reg,mem        [6]   2.8   2.6
OR reg,imm        [3]   2.8   2.6
OR mem,imm        [7]  18.3  16.4
OR acc,imm        [3]   2.8   2.5
PUSH acc        [1]   2.2   2.9
PUSH reg        [1]   2.2   2.9
PUSH mem        [6]   3.3   2.9
PUSHAD        [1]  16.9  22.8
PUSHFD        [1]   8.5   3.0
POP acc        [1]   1.6   1.6
POP reg        [1]   1.6   1.5
POP mem        [6]   3.3   3.1
POPAD*        [2]  32.3  45.2
POPFD*        [2]  38.9  64.7
RCL acc,1        [2]   2.7   5.7
RCL mem,1        [6]  19.1  21.8
RCL reg,cl        [2]  11.3  17.0
RCL mem,cl        [6]  20.5  28.2
RCL reg,16        [3]   8.5  17.0
RCL mem,16        [7]  23.3  28.2
RCR acc,1        [2]   2.7   5.8
RCR mem,1        [6]  19.2  21.9
RCR reg,cl        [2]   8.5  14.2
RCR mem,cl        [6]  21.5  25.7
RCR reg,16        [3]   8.5  14.2
RCR mem,16        [7]  21.6  25.9
ROL acc,1        [2]   2.7   3.0
ROL mem,1        [6]  19.1  17.9
ROL reg,cl        [2]   2.7   5.6
ROL mem,cl        [6]  19.1  16.5
ROL reg,16        [3]   2.8   2.9
ROL mem,16        [7]  18.2  18.2
ROR acc,1        [2]   2.7   2.9
ROR mem,1        [6]  19.1  18.1
ROR reg,cl        [2]   2.7   5.5
ROR mem,cl        [6]  19.1  16.6
ROR reg,16        [3]   2.8   2.9
ROR mem,16        [7]  18.2  18.5
SAHF        [1]   1.1   4.7
SAR acc,1        [2]   2.7   2.5
SAR mem,1        [6]  19.1  16.4
SAR reg,cl        [2]   2.7   5.5
SAR mem,cl        [6]  19.1  16.6
SAR reg,16        [3]   2.8   2.6
SAR mem,16        [7]  18.2  16.5
SBB reg,reg        [2]   2.7   5.5
SBB mem,reg        [6]  19.1  19.7
SBB reg,mem        [6]   2.8   5.4
SBB reg,imm        [3]   2.8   5.4
SBB acc,imm        [3]   2.8   5.7
SCASB        [1]   7.1   3.5
SCASW        [2]   6.7   3.5
SCASD        [1]   7.1   3.5
SETNB al        [3]   2.8   2.5
SETNB m08        [7]   2.8   3.5
SETB al        [3]   2.8   2.6
SETB m08        [7]   2.8   2.9
SETBE al        [3]   2.8   3.1
SETBE m08        [7]   2.8   2.9
SETZ al        [3]   2.8   2.9
SETZ m08        [7]   2.8   2.9
SETNZ al        [3]   2.8   2.6
SETNZ m08        [7]   2.8   2.9
SETL al        [3]   2.8   2.6
SETL m08        [7]   2.8   2.9
SETNL al        [3]   2.8   2.6
SETNL m08        [7]   2.8   2.9
SETLE al        [3]   2.8   2.8
SETLE m08        [7]   2.8   2.9
SETG al        [3]   2.8   2.5
SETG m08        [7]   2.8   2.9
SETS al        [3]   2.8   2.6
SETS m08        [7]   2.8   2.9
SETNS al        [3]   2.8   2.6
SETNS m08        [7]   2.8   2.9
SETC al        [3]   2.8   2.8
SETC m08        [7]   2.8   2.9
SETNC al        [3]   2.8   2.5
SETNC m08        [7]   2.8   2.9
SETO al        [3]   2.8   2.6
SETO m08        [7]   2.8   2.9
SETNO al        [3]   2.8   2.5
SETNO m08        [7]   2.8   2.9
SETP al        [3]   2.8   2.7
SETP m08        [7]   2.8   2.9
SETPE al        [3]   2.8   2.6
SETPE m08        [7]   2.8   2.9
SETNP al        [3]   2.8   2.8
SETNP m08        [7]   2.8   2.9
SETPO al        [3]   2.8   2.5
SETPO m08        [7]   2.8   2.9
SHL acc,1        [2]   2.7   2.6
SHL mem,1        [6]  19.1  16.6
SHL reg,cl        [2]   2.7   5.6
SHL mem,cl        [6]  19.1  16.7
SHL reg,16        [3]   2.8   2.6
SHL mem,16        [7]  18.2  16.5
SHR acc,1        [2]   2.7   2.6
SHR mem,1        [6]  19.1  16.4
SHR reg,cl        [2]   2.7   5.5
SHR mem,cl        [6]  19.1  16.6
SHR reg,16        [3]   2.8   2.5
SHR mem,16        [7]  18.2  16.4
SHLD acc,reg,16     [4]   8.5   2.5
SHLD mem,reg,16     [8]  18.3  16.7
SHLD acc,reg,cl     [3]   8.5   5.5
SHLD mem,reg,cl     [7]  19.4  16.5
SHRD acc,reg,16     [4]   8.5   2.5
SHRD mem,reg,16     [8]  18.3  19.2
SHRD acc,reg,cl     [3]   8.5   6.6
SHRD mem,reg,cl     [7]  19.4  19.9
SMSW ax        [3]   5.7  28.2
SMSW m16        [7]   5.7  25.5
STC        [1]   1.1   1.0
STD        [1]   5.6  13.0
STOSB        [1]   5.7   3.0
STOSW        [2]   5.7   3.0
STOSD        [1]   5.7   2.9
SUB reg,reg        [2]   1.0   1.2
SUB mem,reg        [6]  19.1  16.1
SUB reg,mem        [6]   2.8   2.6
SUB reg,imm        [3]   2.8   2.6
SUB acc,imm        [3]   2.8   2.8
TEST reg,reg        [2]   1.0   1.0
TEST mem,reg        [6]   1.6   1.5
TEST reg,mem        [6]   1.6   1.5
TEST reg,imm        [6]   1.0   1.1
TEST acc,imm        [5]   1.1   1.1
XCHG reg,reg        [2]   2.9   5.9
XCHG reg,mem        [6]  44.9  64.7
XCHG mem,reg        [6]  45.7  64.8
XOR reg,reg        [2]   1.1   0.8
XOR mem,reg        [6]  19.9  16.0
XOR reg,mem        [6]   2.8   2.9
XOR reg,imm        [3]   2.8   2.6
XOR mem,imm        [7]  18.2  16.2
XOR acc,imm        [3]   2.8   2.6

* LEAVE = LEAVE - ENTER
* POPAD = POPAD - PUSHAD
* POPFD = POPFD - PUSHFD
Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 10:09:22 PM
here is the source code for the instruction calculation
however, you need ASM (http://masm32.com/board/index.php?topic=902.msg29945#msg29945) and JWLINK (http://sourceforge.net/projects/jwlink/) to build
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 15, 2014, 02:10:35 AM
hard to test individual instructions
so much relies on the surrounding code
Title: Re: Code location sensitivity of timings
Post by: nidud on July 15, 2014, 05:46:29 AM
hard to test individual instructions
it's actually very easy, in fact the easiest ting to test

        counter_begin 1000, HIGH_PRIORITY_CLASS
        repeat 1000
        lea eax,[eax]
        endm
        counter_end


knowing the approximate time difference between instructions is also useful

        mov     eax,[ebx]   ; 1
        mov     [ebx],ecx   ; 2
        mov     ecx,eax     ; 3 cycles
        ...
        xchg    ecx,[ebx]   ; 50 cycles

this is the old document (http://web.itu.edu.tr/kesgin/mul06/intel/index.html) I have used for this

Agner Fog have a Lists of instruction latencies (http://www.agner.org/optimize/instruction_tables.pdf)
Code: [Select]
MOV     r,r     1   1  1/3 ALU
MOV     r,i     1   1  1/3 ALU
MOV     r8,m8   1   4  1/2 ALU, AGU
...
DIV     r8/m8   32  24  23 ALU
DIV     r16/m16 47  24  23 ALU
DIV     r32/m32 79  40  40 ALU
IDIV    r8      41  17  17 ALU
IDIV    r16     56  25  25 ALU
IDIV    r32     88  41  41 ALU
IDIV    m8      42  17  17 ALU
IDIV    m16     57  25  25 ALU
IDIV    m32     89  41  41 ALU
seems from this the test may not be that far off

Quote
so much relies on the surrounding code
that's true, hence the problem with the test
location, alignment, and the issue of caches
Title: Re: Code location sensitivity of timings
Post by: nidud on July 16, 2014, 08:15:11 AM
I created a new testbed where the code location problem is fixed by having a fixed location for the function (as Dave pointed out). And (as Dave pointed out) the cache problem could be solved by not linking in the functions to test. This is solved by using the –bin switch in JWASM to create a binary of the function.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 20, 2014, 11:43:43 PM
It turns out the data section also have an impact on the result. The code is executed in the data? section so I now (again  :P) use the method suggested by Dave and load the functions in the code section.

In addition to this I flush the code buffer for each new function loaded using FlushInstructionCache (http://msdn.microsoft.com/en-us/library/windows/desktop/ms679350(v=vs.85).aspx).

readit proc uses ebx esi edi fname
   invoke GetCurrentProcess
   invoke FlushInstructionCache,eax,addr proc_x,size_p


I tested some string functions using the strlen approach.

Code: [Select]
strcpy  proc uses edi esi dst, src
mov edi,dst
mov esi,src
mov eax,[esi] ; test the first 4 byte for zero..
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz tail
if 1
;
; at this point there is at least 4 byte in src
;
mov ecx,esi ; align pointer to 4
and ecx,3
mov eax,[esi] ; copy the first 4 bytes
mov [edi],eax
add edi,ecx ; set new offset (+0..3)
add esi,ecx
mov eax,[esi] ; test new offset 0..3
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz @1
endif
align 4 ; align main loop
@@: mov eax,[esi] ; copy aligned DWORD's
mov [edi],eax
add edi,4
add esi,4
mov eax,[esi]
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jz @B
;
; we now have at least 4 byte in the buffer
;
@1: bsf ecx,ecx
shr ecx,3
mov eax,[esi+ecx-3] ; copy 1..4 byte with overlap
mov [edi+ecx-3],eax
align 4
toend:  mov eax,dst
ret
align 4
tail: mov eax,[esi] ; for strings < 4 byte..
mov [edi],al ; 1893/1505/1602 - 3
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
jmp toend
strcpy  endp
here is the SSE version using the same approach
Code: [Select]
SSE_strcpy  proc uses esi edi dst, src
mov edi,dst
mov esi,src
;
; test the first 16 byte for zero..
;
xorps xmm1,xmm1 ; clear xmm1 for compare
movdqu  xmm0,[esi] ; get 16 byte from source
pcmpeqb xmm0,xmm1 ; compare bytes
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz tail
if 0
mov ecx,esi ; align pointer to 16
and ecx,15
movdqu  xmm0,[esi] ; copy 16 byte
movdqu  [edi],xmm0 ;
add edi,ecx ; set new offset
add esi,ecx
pcmpeqb xmm0,xmm1 ; compare bytes
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz @Z
endif
;
; main loop
;
align 4
@@: movdqu  xmm0,[esi] ; copy
movdqu  [edi],xmm0 ;
add esi,16
add edi,16
pcmpeqb xmm0,xmm1 ; test for zero
pmovmskb ecx,xmm0 ;
test ecx,ecx
jz @B
@Z: bsf ecx,ecx
movdqu  xmm0,[esi+ecx-16+1]
movdqu  [edi+ecx-16+1],xmm0
align 4
toend:
mov eax,dst
ret
align 4 ; for strings < 16 byte..
tail:
if 0
bsf ecx,ecx
mov [edi],ch
jz toend
align 4
@@: mov al,[esi]
mov [edi],al
inc edi
inc esi
dec ecx
jnz @B
mov [edi],cl
jmp toend
endif
if 0
bsf ecx,ecx
inc ecx
@@: mov al,[esi+ecx-1]
mov [edi+ecx-1],al
dec ecx
jnz @B
jmp toend
endif
if 0 ; 4387 - small..
bsf ecx,ecx
inc ecx
rep movsb
jmp toend
endif
if 1 ; 1709 - large..
repeat  3
mov eax,[esi]
mov [edi],al
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
test ah,ah
jz toend
add esi,4
add edi,4
endm
mov eax,[esi]
mov [edi],al
test al,al
jz toend
mov [edi],ax
test ah,ah
jz toend
shr eax,16
mov [edi+2],al
test al,al
jz toend
mov [edi+2],ax
jmp toend
endif
SSE_strcpy endp

The first run is aligned strings, second unaligned, and the last a small for the tail.

AMD Athlon(tm) II X2 245 Processor (SSE3)
STRCHR-----------------------------------
597240  cycles - 10 (  0) 0: crt_strcpy
1462096 cycles - 10 ( 37) 1: x
544466  cycles - 10 (151) 2: x
168543  cycles - 10 (251) 3: SSE2

616369  cycles - 10 (  0) 0: crt_strcpy
2062799 cycles - 10 ( 37) 1: x
561664  cycles - 10 (151) 2: x
236658  cycles - 10 (251) 3: SSE2

1850    cycles - 99 (  0) 0: crt_strcpy
5161    cycles - 99 ( 37) 1: x
1603    cycles - 99 (151) 2: x
1993    cycles - 99 (251) 3: SSE2


here is a version of strchr that use a DWORD as argument
the char needs to be populated from ‘c’ to ‘cccc’
Code: [Select]
strchr proc uses ebx string, char
mov eax,string
mov ebx,char
not ebx
align 4
lup: mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jnz tail
add eax,4
sub edx,ebx
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jz lup
bsf ecx,ecx
shr ecx,3
lea eax,[ecx+eax-4]
align 4
toend:  ret
align 4
tail: movzx ebx,byte ptr char
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
inc eax
cmp [eax],bx
je toend
xor eax,eax
jmp toend
strchr endp

timeings for strchr:
82497   cycles - (  0) 0: crt_strchr
91200   cycles - ( 29) 1: x
53679   cycles - (119) 2: 'c'
51788   cycles - (107) 3: 'cccc'

82932   cycles - (  0) 0: crt_strchr
91066   cycles - ( 29) 1: x
59540   cycles - (119) 2: 'c'
54366   cycles - (107) 3: 'cccc'

43756   cycles - (  0) 0: crt_strchr
32192   cycles - ( 29) 1: x
17620   cycles - (119) 2: 'c'
16582   cycles - (107) 3: 'cccc'


timeings for strrchr:
329384  cycles - 10 (  0) 0: crt_strrchr
330506  cycles - 10 ( 40) 1: strrchr
330109  cycles - 10 ( 40) 2: x
109653  cycles - 10 ( 66) 3: x
95977   cycles - 10 ( 72) 4: aligned
123257  cycles - 10 (112) 5: SSE

328457  cycles - 10 (  0) 0: crt_strrchr
328311  cycles - 10 ( 40) 1: strrchr
329352  cycles - 10 ( 40) 2: x
119861  cycles - 10 ( 66) 3: x
102823  cycles - 10 ( 72) 4: aligned
123251  cycles - 10 (112) 5: SSE

27515   cycles - 500 (  0) 0: crt_strrchr
26019   cycles - 500 ( 40) 1: strrchr
26019   cycles - 500 ( 40) 2: x
12508   cycles - 500 ( 66) 3: x
11510   cycles - 500 ( 72) 4: aligned
12007   cycles - 500 (112) 5: SSE


the last run is just a single "c" repeated 500 times

this is a new approach to strstr
Code: [Select]
strstr  proc uses edx ebx dst, src
mov eax,src
mov ebx,[eax]
mov bh,bl
mov eax,ebx
shl ebx,16
mov bx,ax
mov eax,dst
not ebx
align 4
lup: mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jnz tail
add eax,4
sub edx,ebx
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jz lup
bsf ecx,ecx
shr ecx,3
lea eax,[ecx+eax-4]
xor ecx,ecx
mov edx,src
mov dst,eax
inc eax
inc edx
align 4
lup2: xor cl,[edx]
jz found
sub cl,[eax]
jnz lup
inc eax
inc edx
jmp lup2
align 4
found:  mov eax,dst
align 4
toend:  ret
align 4
tail: shr ecx,8
jz @F
not ebx
cmp [eax],bl
je @3
shr ecx,8
jz @F
inc eax
cmp [eax],bl
je @3
shr ecx,8
jz @F
inc eax
cmp [eax],bl
je @3
@@: xor eax,eax
jmp toend
@3: xor ecx,ecx
mov edx,src
mov dst,eax
inc eax
inc edx
align 4
lup3: xor cl,[edx]
jz found
sub cl,[eax]
jnz toend
inc eax
inc edx
jmp lup3
strstr  endp

time for strstr (the above is 5)
537610  cycles -  10 (  0) 0: crt_strstr
918998  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
553673  cycles -  10 ( 46) 2: strstr
586127  cycles -  10 ( 57) 3: x
309336  cycles -  10 (148) 4: x
207128  cycles -  10 (176) 5: x

533403  cycles -  10 (  0) 0: crt_strstr
938340  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
554530  cycles -  10 ( 46) 2: strstr
584338  cycles -  10 ( 57) 3: x
322246  cycles -  10 (148) 4: x
215947  cycles -  10 (176) 5: x

14019   cycles - 500 (  0) 0: crt_strstr
27515   cycles - 500 (  0) 1: InString(1,dst,src) - 1 + dst
13264   cycles - 500 ( 46) 2: strstr
14518   cycles - 500 ( 57) 3: x
14518   cycles - 500 (148) 4: x
20519   cycles - 500 (176) 5: x
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 21, 2014, 04:40:44 AM
those tests run way too fast to get reliable readings

for best results:
1) bind to a single core
2) wait about 750 mS before performing any tests - this allows the system to settle
3) adjust individual loop counts so that each test pass takes about 0.5 seconds

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
170724  cycles - (  0) 0: crt_strchr
172046  cycles - ( 29) 1: x
70540   cycles - (119) 2: 'c'
62934   cycles - (107) 3: 'cccc'

170343  cycles - (  0) 0: crt_strchr
167063  cycles - ( 29) 1: x
89640   cycles - (119) 2: 'c'
82154   cycles - (107) 3: 'cccc'

240068  cycles - (  0) 0: crt_strchr
87783   cycles - ( 29) 1: x
25549   cycles - (119) 2: 'c'
23995   cycles - (107) 3: 'cccc'
--- ok ---

H:\nidudString\string\strchr => strchr

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
209073  cycles - (  0) 0: crt_strchr
221060  cycles - ( 29) 1: x
80107   cycles - (119) 2: 'c'
83854   cycles - (107) 3: 'cccc'

198648  cycles - (  0) 0: crt_strchr
211263  cycles - ( 29) 1: x
106992  cycles - (119) 2: 'c'
96168   cycles - (107) 3: 'cccc'

253182  cycles - (  0) 0: crt_strchr
84552   cycles - ( 29) 1: x
26531   cycles - (119) 2: 'c'
36056   cycles - (107) 3: 'cccc'

they're all over the place   :P
Title: Re: Code location sensitivity of timings
Post by: nidud on July 21, 2014, 06:19:30 AM
they're all over the place   :P

Before I made the changes my machine was all over the place and yours was stable using the same test   :P

Now the AMD I use produce the same result all the time with no deviation regardless of delay or not, so that's good if you in a hurry. The test is also constructed this way, so there are three runs with different loop count and input.

Using an Intel machine produce this result:
Code: [Select]
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
---------------------------------------------
88029   cycles - (  0) 0: crt_strchr
83409   cycles - ( 29) 1: x
50004   cycles - (119) 2: 'c'
49914   cycles - (107) 3: 'cccc'

84574   cycles - (  0) 0: crt_strchr
83375   cycles - ( 29) 1: x
50586   cycles - (119) 2: 'c'
50408   cycles - (107) 3: 'cccc'

109988  cycles - (  0) 0: crt_strchr
22161   cycles - ( 29) 1: x
19388   cycles - (119) 2: 'c'
16840   cycles - (107) 3: 'cccc'

--2--

93149   cycles - (  0) 0: crt_strchr
83171   cycles - ( 29) 1: x
49828   cycles - (119) 2: 'c'
49987   cycles - (107) 3: 'cccc'

84252   cycles - (  0) 0: crt_strchr
83229   cycles - ( 29) 1: x
50892   cycles - (119) 2: 'c'
50861   cycles - (107) 3: 'cccc'

110783  cycles - (  0) 0: crt_strchr
22027   cycles - ( 29) 1: x
19821   cycles - (119) 2: 'c'
16775   cycles - (107) 3: 'cccc'

--3--

100491  cycles - (  0) 0: crt_strchr
83365   cycles - ( 29) 1: x
50163   cycles - (119) 2: 'c'
50950   cycles - (107) 3: 'cccc'

84368   cycles - (  0) 0: crt_strchr
82911   cycles - ( 29) 1: x
50919   cycles - (119) 2: 'c'
52683   cycles - (107) 3: 'cccc'

110577  cycles - (  0) 0: crt_strchr
22136   cycles - ( 29) 1: x
19384   cycles - (119) 2: 'c'
16812   cycles - (107) 3: 'cccc'

If you organize the result you get this:

88029   cycles - (  0) 0: crt_strchr
93149   cycles - (  0) 0: crt_strchr
100491  cycles - (  0) 0: crt_strchr

83409   cycles - ( 29) 1: x
83171   cycles - ( 29) 1: x
83365   cycles - ( 29) 1: x

50004   cycles - (119) 2: 'c'
49828   cycles - (119) 2: 'c'
50163   cycles - (119) 2: 'c'

49914   cycles - (107) 3: 'cccc'
49987   cycles - (107) 3: 'cccc'
50950   cycles - (107) 3: 'cccc'

--unaligned--

84574   cycles - (  0) 0: crt_strchr
84252   cycles - (  0) 0: crt_strchr
84368   cycles - (  0) 0: crt_strchr

83375   cycles - ( 29) 1: x
83229   cycles - ( 29) 1: x
82911   cycles - ( 29) 1: x

50586   cycles - (119) 2: 'c'
50892   cycles - (119) 2: 'c'
50919   cycles - (119) 2: 'c'

50408   cycles - (107) 3: 'cccc'
50861   cycles - (107) 3: 'cccc'
52683   cycles - (107) 3: 'cccc'

--short test—

109988  cycles - (  0) 0: crt_strchr
110783  cycles - (  0) 0: crt_strchr
110577  cycles - (  0) 0: crt_strchr

22161   cycles - ( 29) 1: x
22027   cycles - ( 29) 1: x
22136   cycles - ( 29) 1: x

19388   cycles - (119) 2: 'c'
19821   cycles - (119) 2: 'c'
19384   cycles - (119) 2: 'c'

16840   cycles - (107) 3: 'cccc'
16775   cycles - (107) 3: 'cccc'
16812   cycles - (107) 3: 'cccc'



Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:21:19 AM
some more string probing..

I rewrote most of the string function and need some timings

here is the time for strcpy:

AMD Athlon(tm) II X2 245 Processor (SSE3)
-------------------------------------------
-- aligned strings --
605184  cycles -  10 (  0) 0: crt_strcpy
1490896 cycles -  10 ( 37) 1: movsb
587922  cycles -  10 (118) 2: aligned
554282  cycles -  10 ( 85) 3: unaligned
171751  cycles -  10 (188) 4: SSE aligned
172081  cycles -  10 (141) 5: SSE unaligned
-- unaligned strings --
607417  cycles -  10 (  0) 0: crt_strcpy
1489108 cycles -  10 ( 37) 1: movsb
583246  cycles -  10 (118) 2: aligned
631376  cycles -  10 ( 85) 3: unaligned
171912  cycles -  10 (188) 4: SSE aligned
233758  cycles -  10 (141) 5: SSE unaligned
-- short strings --
124021  cycles - 800 (  0) 0: crt_strcpy
337044  cycles - 800 ( 37) 1: movsb
116820  cycles - 800 (118) 2: aligned
116822  cycles - 800 ( 85) 3: unaligned
45622   cycles - 800 (188) 4: SSE aligned
46020   cycles - 800 (141) 5: SSE unaligned
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:28:20 AM
this is the strlen version
Code: [Select]
strlen  proc uses edx string
mov eax,string
align 4
@@: mov edx,[eax]
add eax,4
lea ecx,[edx-01010101h]
not edx
and ecx,edx
and ecx,80808080h
jz @B
bsf ecx,ecx
shr ecx,3
sub eax,string
lea eax,[eax+ecx-4]
mov ecx,eax
ret
strlen  endp

and the SSE2 version:
Code: [Select]
strlen  proc string
mov ecx,string
xorps xmm1,xmm1 ; SSE2
align 4
@@: movdqu  xmm0,[ecx] ; SSE2
pcmpeqb xmm0,xmm1 ; SSE2
pmovmskb eax,xmm0 ; SSE2
add ecx,16 ;
test eax,eax
jz @B
bsf eax,eax
sub ecx,string
lea eax,[eax+ecx-16]
mov ecx,eax
ret
strlen  endp

here is the time for strlen:
AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------
127029  cycles - 1000 (  0) 0: crt_strchr
102667  cycles - 1000 ( 51) 1: unaligned
123454  cycles - 1000 ( 91) 2: aligned
51428   cycles - 1000 ( 47) 3: SSE

128706  cycles - 1000 (  0) 0: crt_strchr
102644  cycles - 1000 ( 51) 1: unaligned
123452  cycles - 1000 ( 91) 2: aligned
50759   cycles - 1000 ( 47) 3: SSE

127429  cycles - 1000 (  0) 0: crt_strchr
102647  cycles - 1000 ( 51) 1: unaligned
124072  cycles - 1000 ( 91) 2: aligned
50757   cycles - 1000 ( 47) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:55:38 AM
some fixup on the strchr function:
Code: [Select]
strchr proc uses edx ebx string, char
mov eax,char
mov ah,al
mov ebx,eax
shl ebx,16
mov bx,ax
not ebx
mov eax,string
align 4
lup: mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jnz tail
add eax,4
sub edx,ebx
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080h
jz lup
sub eax,4
align 2
tail: mov ecx,[eax]
test cl,cl
jz null
not ebx
cmp bl,cl
je toend
test ch,ch
jz null
inc eax
cmp bl,ch
je toend
shr ecx,16
test cl,cl
jz null
inc eax
cmp cl,bl
je toend
align 2
null: xor eax,eax
align 2
toend:  ret
strchr endp


AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------
-- aligned strings --
330270  cycles - (  0) 0: crt_strchr
364594  cycles - ( 29) 1: x
213070  cycles - (105) 2: 'c'
212955  cycles - (112) 9: no bsf
-- unaligned strings --
332545  cycles - (  0) 0: crt_strchr
362595  cycles - ( 29) 1: x
237024  cycles - (105) 2: 'c'
237257  cycles - (112) 9: no bsf
-- short strings --
43523   cycles - (  0) 0: crt_strchr
31017   cycles - ( 29) 1: x
16518   cycles - (105) 2: 'c'
17020   cycles - (112) 9: no bsf
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 02:08:28 AM
the strstr function also failed, so here is an updated version

AMD Athlon(tm) II X2 245 Processor (SSE3)
-------------------------------------------
-- aligned strings --
533061  cycles -  10 (  0) 0: crt_strstr
918530  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
668698  cycles -  10 ( 46) 2: strstr
565355  cycles -  10 ( 57) 3: x
308053  cycles -  10 (150) 4: x
205373  cycles -  10 (176) 5: x
-- unaligned strings --
533096  cycles -  10 (  0) 0: crt_strstr
938277  cycles -  10 (  0) 1: InString(1,dst,src) - 1 + dst
666364  cycles -  10 ( 46) 2: strstr
565143  cycles -  10 ( 57) 3: x
321082  cycles -  10 (150) 4: x
215754  cycles -  10 (176) 5: x
-- short strings --
86064   cycles - 500 (  0) 0: crt_strstr
192023  cycles - 500 (  0) 1: InString(1,dst,src) - 1 + dst
113705  cycles - 500 ( 46) 2: strstr
89017   cycles - 500 ( 57) 3: x
73041   cycles - 500 (150) 4: x
67541   cycles - 500 (176) 5: x
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 03:41:08 AM
So, how to implement the SSE functions into the library?

I'm now using functions like SetFilePointerEx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa365542(v=vs.85).aspx):
Quote
Requirements
Minimum supported client
Windows XP [desktop apps | Windows Store apps]
Minimum supported server
Windows Server 2003 [desktop apps | Windows Store apps]

The SSE level used is SSE2 so how common is this combination?

The implementation is currently to include both versions. I use the GetSSELevel (http://masm32.com/board/index.php?topic=3373.msg35658#msg35658) function to set a sselevel variable on startup. Each module will then auto install on demand:
Code: [Select]
.data
p_strlen dd strlen
.code
Install:
.if sselevel & SSE_SSE2
    mov p_strlen,SSE_strlen
.endif
ret

pragma_init Install,41

end

and the header file:
Code: [Select]
ifdef __SSE__
pr1 typedef proto :dword
externdef p_strlen:ptr pr1
strlen equ <p_strlen>
else
strlen proto :ptr byte
endif

Both of the functions will then have to be included in the binary in order to compensate for the missing SSE functions, so what is the norm?
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 23, 2014, 05:36:38 AM
Minimum supported client
Windows XP

The SSE level used is SSE2 so how common is this combination?

It may hurt the feelings of some fans of old hard- and software, but writing code for >=(SSE2 & Win XP) should be OK for 99% of the users.

There is a poll on SSE support here (http://www.insanelymac.com/forum/topic/35109-sse2-vs-sse3-the-poll/): "I'm still waiting for SSE support :) (5 votes [2.45%])"

That was 2006, 8 years ago ;)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 06:43:29 AM
just implement it then I guess

is it normal to test the SSE level and exit if not present?
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:48:38 AM
...or provide fallback routines
you can run a little startup init routine - detect SSE support level - and fill in addresses of PROC's
i am working on something along that line at the moment

these define TYPE's for up to 6 dword parms - you can extend it easily
Code: [Select]
_FUNC00  TYPEDEF PROTO
_FUNC04  TYPEDEF PROTO :DWORD
_FUNC08  TYPEDEF PROTO :DWORD,:DWORD
_FUNC12  TYPEDEF PROTO :DWORD,:DWORD,:DWORD
_FUNC16  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD
_FUNC20  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD
_FUNC24  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD,:DWORD

_PFUNC00 TYPEDEF Ptr _FUNC00
_PFUNC04 TYPEDEF Ptr _FUNC04
_PFUNC08 TYPEDEF Ptr _FUNC08
_PFUNC12 TYPEDEF Ptr _FUNC12
_PFUNC16 TYPEDEF Ptr _FUNC16
_PFUNC20 TYPEDEF Ptr _FUNC20
_PFUNC24 TYPEDEF Ptr _FUNC24

then, i am using a structure with function pointers in it
Code: [Select]
_FUNC STRUCT
  lpfnFunc1  _PFUNC04 ?    ;this function has 1 dword arg
  lpfnFunc2  _PFUNC12 ?    ;this function has 3 dword args
_FUNC STRUCT

and, in the .DATA? section...
Code: [Select]
_Func _FUNC <>
so, you set _Func.lpfnFunc1 and _Func.lpfnFunc2 to point at appropriate routines for the supported SSE level
then.....
Code: [Select]
    INVOKE  _Func.lpfnFunc1,arg1
    INVOKE  _Func.lpfnFunc2,arg1,arg2,arg3

;or

    push    edi
    mov     edi,offset _Func
    INVOKE  [edi]._FUNC.lpfnFunc1,arg1
    INVOKE  [edi]._FUNC.lpfnFunc2,arg1,arg2,arg3
    pop     edi

another way to go would be to put all the routines for each support level into a DLL
then, at init, load the DLL that is appropriate for the machine
the routines can then all have the same names
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:53:25 AM
most people probably have at least SSE3
however, we can look at the forum members, alone, and find a few machines
some that probably support only MMX or SSE(1)

i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 23, 2014, 09:03:40 AM
i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old

SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 09:26:21 AM
Quote
you can run a little startup init routine - detect SSE support level - and fill in addresses of PROC's
I use modular libraries with startup modules:
Code: [Select]
.486
.model  flat
option  casemap:none

public  _cstart_

ifdef __WCC__
C0_main equ <main_>
extrn main_:abs
else
C0_main equ <main>
main proto c
endif

exit proto stdcall :dword
Initialize proto stdcall :dword, :dword

_INIT segment dword flat public 'INIT'
_INIT ENDS
_IEND segment dword flat public 'INIT'
_IEND ENDS

.code

_cstart_:
mov edx,offset _INIT
mov eax,offset _IEND
invoke  Initialize,edx,eax
call C0_main
invoke  exit,eax
end _cstart_

so the startup routine execute proc’s in the _INIT segment
Code: [Select]
pragma_init macro pp, priority
_INIT segment dword flat public 'INIT'
dd pp
dd priority
_INIT ends
endm

The SSE init code is then
Code: [Select]
;
; Stolen fom Dave at the end of the road (dedndave)
;
; http://masm32.com/board/index.php?topic=3373.msg35658#msg35658
;
include math.inc
include io.inc
include stdlib.inc

public  sselevel

.data
sselevel dd 0

error db "CPU error: Need SSE2 level",13,10
size_m equ $ - offset error

.code
.586

Install:

    pushfd
    pop     eax
    mov     ecx,200000h
    mov     edx,eax
    xor     eax,ecx
    push    eax
    popfd
    pushfd
    pop     eax
    xor     eax,edx
    and     eax,ecx
    .if !ZERO?
push ebx
xor eax,eax
cpuid
.if eax
    .if ah == 5
xor eax,eax
    .else
mov eax,1
cpuid
xor eax,eax
bt ecx,20       ;SSE4.2
rcl eax,1       ;into bit 6
bt ecx,19       ;SSE4.1
rcl eax,1       ;into bit 5
bt ecx,9       ;SSSE3
rcl eax,1       ;into bit 4
bt ecx,0       ;SSE3
rcl eax,1       ;into bit 3
bt edx,26       ;SSE2
rcl eax,1       ;into bit 2
bt edx,25       ;SSE
rcl eax,1       ;into bit 1
bt ecx,0       ;MMX
rcl eax,1       ;into bit 0
mov sselevel,eax
    .endif
.endif
pop ebx
    .endif
    .if !(eax & SSE_SSE2)
invoke GetStdHandle,STD_OUTPUT_HANDLE
push eax
mov edx,esp
invoke WriteFile,eax,addr error,size_m,edx,0
pop eax
invoke ExitProcess,0
    .endif
    ret

pragma_init Install,4

end

and the modules will the install as needed
Code: [Select]
.code

strlen  proc uses edx string:ptr byte
mov eax,string
align 4
@@: mov edx,[eax]
add eax,4
lea ecx,[edx-01010101h]
not edx
and ecx,edx
and ecx,80808080h
jz @B
bsf ecx,ecx
shr ecx,3
sub eax,string
lea eax,[eax+ecx-4]
mov ecx,eax
ret
strlen  endp

ifdef __SSE__

public  p_strlen

.data
p_strlen dd strlen

.code
.686
.xmm


SSE_strlen proc string:ptr byte
mov ecx,string
xorps xmm1,xmm1 ; SSE2
align 4
@@: movdqu  xmm0,[ecx] ; SSE2
pcmpeqb xmm0,xmm1 ; SSE2
pmovmskb eax,xmm0 ; SSE2
add ecx,16 ;
test eax,eax
jz @B
bsf eax,eax
sub ecx,string
lea eax,[eax+ecx-16]
mov ecx,eax
ret
SSE_strlen endp

Install:
.if sselevel & SSE_SSE2
    mov p_strlen,SSE_strlen
.endif
ret

pragma_init Install,41

endif

END

The priority of 4 will then be called before 41.

In this way only used functions will be linked in and called by the Initialize() function. The exit() function also use the Initialize() with an _EXIT segment (.map file):

_INIT             INIT         AUTO        00428c04        00000050
_IEND             INIT         AUTO        00428c54        00000000
_EXIT             EXIT         AUTO        00428c54        00000010
_EEND             EXIT         AUTO        00428c64        00000000

Quote
then, i am using a structure with function pointers in it

That will be a bit complicated. I have C-files calling strlen and other functions hundrets of times so the proc name have to be the same. Then you only need to change the header and recompile.

//int strlen(char *);
int (*p_strlen)(char *);
#define strlen p_strlen


It is also possible to "code it" directly

//int strlen(char *);
int (*strlen)(char *);
...
;strlen   proto :dword
externdef   strlen:ptr pr1


Code: [Select]
.data
strlen  dd strlen1
public  strlen
.code
strlen1:
mov eax,[esp+4]
...
ret 4
strlen2:
mov ecx,[esp+4]
...
ret 4
Install:
.if sselevel & SSE_SSE2
    mov strlen,strlen2
.endif
ret

pragma_init Install,41

SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

The simplest way will be to just implement it with an ifdef in the function body. That way I may recompile the old source if needed. Think I will keep the SSE level test thought.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 24, 2014, 03:45:40 AM
here is the memcpy test
larger blocks is faster but the tail bytes is a problem
using movsb on the tail is painfully slow

Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
and ecx,-32
jz tail
test edi,11B ; aligned ?
jnz align4
align 4
lup: sub ecx,32
movdqu  xmm0,[esi+ecx]
movdqu  xmm1,[esi+ecx+16]
movdqu  [edi+ecx],xmm0
movdqu  [edi+ecx+16],xmm1
jnz lup
mov ecx,count
movdqu  xmm0,[esi+ecx-32]
movdqu  xmm1,[esi+ecx-16]
movdqu  [edi+ecx-32],xmm0
movdqu  [edi+ecx-16],xmm1
align 2
toend:  mov eax,dst
ret
align 4
align4:
mov eax,edi ; align 16
neg eax
and eax,1111B
movdqu  xmm0,[esi] ; copy 16 byte
movdqu  [edi],xmm0
add edi,eax ; set new offset
add esi,eax
jmp lup
align 4
tail: xor ecx,count
jz toend
test ecx,-2
jz @1
test ecx,-4
jz @2
test ecx,-8
jz @4
test ecx,-16
jz @8
movdqu  xmm0,[esi] ; 31 byte
movdqu  [edi],xmm0 ; |16...|
movdqu  xmm0,[esi+ecx-16] ; |...16|
movdqu  [edi+ecx-16],xmm0
jmp toend
align 4
@8: movq xmm0,[esi] ; 15 byte
movq [edi],xmm0 ; |8...|
movq xmm0,[esi+ecx-8] ; |...8|
movq [edi+ecx-8],xmm0
jmp toend
align 4
@4: mov eax,[esi]
mov [edi],eax
mov eax,[esi+ecx-4]
mov [edi+ecx-4],eax
jmp toend
align 4
@2: mov eax,[esi]
mov [edi],ax
shr eax,16
mov [edi+ecx-1],al
jmp toend
align 4
@1: mov al,[esi]
mov [edi],al
jmp toend
memcpy  endp

here is the time
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------
-- aligned strings --
225399  cycles -  10 (  0) 0: crt_memcpy
222840  cycles -  10 (145) 1: 16
220338  cycles -  10 (209) 3: 32
222193  cycles -  10 (309) 5: 64
222769  cycles -  10 (186) 2: 16 aligned
219131  cycles -  10 (249) 4: 32 aligned
223450  cycles -  10 (349) 6: 64 aligned
223475  cycles -  10 (145) 7: 64 movsb
-- unaligned strings --
233090  cycles -  10 (  0) 0: crt_memcpy
311348  cycles -  10 (145) 1: 16
302111  cycles -  10 (209) 3: 32
319732  cycles -  10 (309) 5: 64
233073  cycles -  10 (186) 2: 16 aligned
215451  cycles -  10 (249) 4: 32 aligned
224974  cycles -  10 (349) 6: 64 aligned
319863  cycles -  10 (145) 7: 64 movsb
-- short strings 15 --
209267  cycles - 8000 (  0) 0: crt_memcpy
171989  cycles - 8000 (145) 1: 16
146799  cycles - 8000 (209) 3: 32
151210  cycles - 8000 (309) 5: 64
147463  cycles - 8000 (186) 2: 16 aligned
148036  cycles - 8000 (249) 4: 32 aligned
142108  cycles - 8000 (349) 6: 64 aligned
420999  cycles - 8000 (145) 7: 64 movsb

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 24, 2014, 04:32:46 AM
SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

SSE2 was introduced in November 2000 with the P4 Willamette. In general, it's absolutely sufficient (try your luck, make Instr_() faster with SSE7.8... (http://masm32.com/board/index.php?topic=3408.msg36297#msg36297)); in particular, pcmpeqb and pmovmskb are important improvements.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 04:29:11 AM
with regard to memcpy there seems little gain using SSE

Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
mov edi,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
test ecx,-16
jz @F
mov eax,[esi]
mov [edi],eax
mov eax,edi
neg eax
and eax,11B
add edi,eax
add esi,eax
sub ecx,eax
@@: rep movsb
mov eax,[esp+12]
pop edi
pop esi
ret 12
memcpy  endp


-- aligned strings --
889418  cycles -  10 (  0) 0: crt_memcpy
891309  cycles -  10 ( 48) 1: movsb
854402  cycles -  10 (182) 2: SSE
-- unaligned strings --
923670  cycles -  10 (  0) 0: crt_memcpy
924396  cycles -  10 ( 48) 1: movsb
881774  cycles -  10 (182) 2: SSE
-- short strings --
805432  cycles - 8000 (  0) 0: crt_memcpy
1306044 cycles - 8000 ( 48) 1: movsb
520039  cycles - 8000 (182) 2: SSE


using MOVSD helps on the short strings
Code: [Select]
add esi,eax
sub ecx,eax ;--
mov eax,ecx
shr ecx,2
rep movsd
and eax,11B
mov ecx,eax ;--
@@: rep movsb

-- short strings --
805432  cycles - 8000 (  0) 0: crt_memcpy
819735  cycles - 8000 ( 62) 1: movsb
520039  cycles - 8000 (182) 2: SSE


conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU


String compare

Code: [Select]
strcmp  proc uses edi esi dst, src
mov edi,dst
mov esi,src
lea edi,[edi-4]
lea esi,[esi-4]
align 16 ; align main loop
@@: lea edi,[edi+4]
lea esi,[esi+4]
mov eax,[esi]
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz @F
not eax
cmp eax,[edi]
je @B
align 4
@@: mov al,[edi]
mov ah,[esi]
inc edi
inc esi
test al,al
jz @F
cmp al,ah
je @B
sbb al,al
sbb al,-1
@@: movsx eax,al
ret
strcmp endp

Code: [Select]
strcmp proc uses esi edi s1, s2
mov edi,s1
mov esi,s2
xorps xmm2,xmm2 ; clear xmm2 for zero test
align 16
@@: movdqu  xmm0,[esi]
movdqu  xmm1,[edi]
pcmpeqb xmm1,xmm0 ; compare
pmovmskb eax,xmm1
pcmpeqb xmm0,xmm2 ; test for zero
pmovmskb ecx,xmm0
lea edi,[edi+16]
lea esi,[esi+16]
not ax
or ecx,eax
jz @B
bsf ecx,ecx
mov al,[edi+ecx-16]
test al,al
jz @F
cmp al,[esi+ecx-16]
sbb al,al
sbb al,-1
@@: movsx eax,al
ret
strcmp endp


AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------
-- large strings: 4096 byte --
822761  cycles - 100 (  0) 0: crt_strcmp
1646116 cycles - 100 ( 40) 1: strcmp
415725  cycles - 100 ( 91) 2: x
107534  cycles - 100 ( 81) 3: SSE
-- small strings: 64 byte --
163135  cycles - 999 (  0) 0: crt_strcmp
306028  cycles - 999 ( 40) 1: strcmp
109462  cycles - 999 ( 91) 2: x
48633   cycles - 999 ( 81) 3: SSE



Code: [Select]
stricmp  proc uses edi esi dst, src
mov edi,dst
mov esi,src
lea edi,[edi-4]
lea esi,[esi-4]
align 16 ; align main loop
@@: lea edi,[edi+4]
lea esi,[esi+4]
mov eax,[esi]
lea ecx,[eax-01010101H]
not eax
and ecx,eax
and ecx,80808080h
jnz @F
mov eax,[esi]
cmp eax,[edi]
je @B
mov ecx,[edi]
or eax,20202020h
or ecx,20202020h
cmp eax,ecx
je @B
align 4
@@: mov al,[edi]
mov ah,[esi]
inc edi
inc esi
test al,al
jz @F
cmp al,ah
je @B
or ax,2020h
cmp al,ah
je @B
sbb al,al
sbb al,-1
@@: movsx eax,al
ret
stricmp endp

Code: [Select]
stricmp proc uses esi edi s1, s2
mov edi,s1
mov esi,s2
xorps xmm2,xmm2 ; clear xmm2 for zero test
mov eax,20202020h
movd xmm3,eax
pshufd  xmm3,xmm3,0 ; populate 20h for case
align 16
@@: movdqu  xmm0,[esi]
movdqu  xmm1,[edi]
movdqa  xmm4,xmm0
pcmpeqb xmm4,xmm2 ; test for zero
pmovmskb ecx,xmm4
orps xmm0,xmm3 ; A..Z to a..z
orps xmm1,xmm3
pcmpeqb xmm1,xmm0 ; compare
pmovmskb eax,xmm1
lea edi,[edi+16]
lea esi,[esi+16]
not ax
or ecx,eax
jz @B
bsf ecx,ecx
mov al,[edi+ecx-16]
test al,al
jz @F
cmp al,[esi+ecx-16]
sbb al,al
sbb al,-1
@@: movsx eax,al
ret
stricmp endp


-- large strings: 4096 byte --
1647384 cycles - 100 (  0) 0: crt__stricmp
1646490 cycles - 100 ( 72) 1: stricmp
414651  cycles - 100 (119) 2: x
158892  cycles - 100 (107) 3: SSE
-- small strings: 64 byte --
339452  cycles - 999 (  0) 0: crt__stricmp
298264  cycles - 999 ( 72) 1: stricmp
105344  cycles - 999 (119) 2: x
60089   cycles - 999 (107) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:33:53 AM
with regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think. (http://www.masmforum.com/board/index.php?topic=11454.msg87622#msg87622)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 07:19:55 AM
 :biggrin:

Yes, it’s possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

My conclusion is that moving memory is a hardware issue which improves in newer hardware. If you look at the latest version of MEMCPY.ASM provided by Intel you basically see (how and) what they working on in the hardware:

Code: [Select]
; See if Enhanced Fast Strings is supported.
; ENFSTRG supported?
bt __favor, __FAVOR_ENFSTRG
jnc CopyUpSSE2Check ; no jump
;
; use Enhanced Fast Strings
rep movsb
jmp TrailUp0 ; Done
CopyUpSSE2Check:
;
; Next, see if we can use a "fast" copy SSE2 routine
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:01:49 PM
:biggrin:

Yes, it’s possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version (http://masm32.com/board/index.php?topic=1971.msg20618#msg20618)). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest
      xchg eax, edi        ; for CAT$, return a pointer to the end of the destination;
Title: Re: Code location sensitivity of timings
Post by: sinsi on July 25, 2014, 08:10:40 PM
How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 10:21:23 PM
For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...).

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest

1:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count

mov eax,ecx
shr ecx,2
align 4
rep movsd
and eax,11B
mov ecx,eax
rep movsb

mov eax,dst
ret
memcpy  endp

2:
Code: [Select]
push ecx
shr ecx,2
align 4
rep movsd
pop ecx
and ecx,11B
rep movsb

3:
Code: [Select]
align 4
rep movsb


AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- aligned strings --
1082814     cycles -  10 (  0) 0: crt_memcpy
1075836     cycles -  10 ( 38) 1: movsd - mov eax,ecx
1079960     cycles -  10 ( 37) 2: movsd - push ecx
1075643     cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1923314     cycles -  10 (  0) 0: crt_memcpy
1959236     cycles -  10 ( 38) 1: movsd - mov eax,ecx
1957439     cycles -  10 ( 37) 2: movsd - push ecx
8000102     cycles -  10 ( 27) 3: movsb
-- short strings 15 --
210825     cycles - 8000 (  0) 0: crt_memcpy
320028     cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344116     cycles - 8000 ( 37) 2: movsd - push ecx
312027     cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1614638     cycles - 8000 (  0) 0: crt_memcpy
1676166     cycles - 8000 ( 38) 1: movsd - mov eax,ecx
1705396     cycles - 8000 ( 37) 2: movsd - push ecx
3633202     cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
5307541     cycles - 4000 (  0) 0: crt_memcpy
5823146     cycles - 4000 ( 38) 1: movsd - mov eax,ecx
5827790     cycles - 4000 ( 37) 2: movsd - push ecx
12299823    cycles - 4000 ( 27) 3: movsb
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 25, 2014, 10:56:33 PM
Hi sinsi,

your memcpy application brings:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 11:41:54 PM
well, there is the future for you right there  :biggrin:

good stuff (and harware)  :t

I was thinking of using the BT sselevel,? with the result from Dave's function above. However, I have SSE3 and the SSE function seems faster on that level so the test may then be SSE3 and below for these functions. If you could also try this one to see if MOVSB also is faster than SSE, that will be good.

In this test I align EDI on all proc's:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count

test ecx,-4
jz @F
mov eax,[esi]
mov [edi],eax
mov eax,edi
neg eax
and eax,11B
add edi,eax
add esi,eax
sub ecx,eax

...

align 16
@@: rep movsb

mov eax,dst
ret
memcpy  endp

and added a SSE proc:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
;
; need 16 byte for overlap..
;
test ecx,-16
jz tail
;
; fix tail bytes and aligned bytes
;
movdqu  xmm0,[esi]
movdqu  [edi],xmm0
movdqu  xmm0,[esi+ecx-16]
movdqu  [edi+ecx-16],xmm0
;
; align EDI 16
;
mov eax,edi
neg eax
and eax,1111B
add edi,eax
add esi,eax
and ecx,-16

align 16
lup: sub ecx,16
movdqu  xmm0,[esi+ecx]  ; do  copy
movdqa  [edi+ecx],xmm0  ; aligned move
jnz lup
align 16
toend:
mov eax,dst
ret
align 4
tail: test ecx,ecx
jz toend
test ecx,-2
jz @1
test ecx,-4
jz @2
test ecx,-8
jz @4
movq xmm0,[esi] ; move 8..15 byte
movq [edi],xmm0 ; |8...|
movq xmm0,[esi+ecx-8] ; |...8|
movq [edi+ecx-8],xmm0
jmp toend
align 4
@4: mov eax,[esi]
mov [edi],eax
mov eax,[esi+ecx-4]
mov [edi+ecx-4],eax
jmp toend
align 4
@2: mov eax,[esi]
mov [edi],ax
shr eax,16
mov [edi+ecx-1],al
jmp toend
align 4
@1: mov al,[esi]
mov [edi],al
jmp toend
memcpy  endp

and now I get this result:
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- aligned strings --
1075537     cycles -  10 (  0) 0: crt_memcpy
1075123     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1075128     cycles -  10 ( 75) 2: movsd - push ecx
1076079     cycles -  10 ( 59) 3: movsb
846397     cycles -  10 (182) 4: SSE
-- unaligned strings --
1111510     cycles -  10 (  0) 0: crt_memcpy
1106994     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1110818     cycles -  10 ( 75) 2: movsd - push ecx
1108074     cycles -  10 ( 59) 3: movsb
852071     cycles -  10 (182) 4: SSE
-- short strings 15 --
200777     cycles - 8000 (  0) 0: crt_memcpy
312027     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
328222     cycles - 8000 ( 75) 2: movsd - push ecx
288027     cycles - 8000 ( 59) 3: movsb
112031     cycles - 8000 (182) 4: SSE
-- short strings 271 --
1304626     cycles - 8000 (  0) 0: crt_memcpy
1304307     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
1338701     cycles - 8000 ( 75) 2: movsd - push ecx
1568333     cycles - 8000 ( 59) 3: movsb
498757     cycles - 8000 (182) 4: SSE
-- short strings 2014 --
2439537     cycles - 4000 (  0) 0: crt_memcpy
2458600     cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2465365     cycles - 4000 ( 75) 2: movsd - push ecx
2448965     cycles - 4000 ( 59) 3: movsb
1139728     cycles - 4000 (182) 4: SSE

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 01:14:05 AM
Seems to be better

x1=205 1543/884593  = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-----------------------------------------------------
-- aligned strings --
1188974   cycles -  10 (  0) 0: crt_memcpy
1097640   cycles -  10 ( 75) 1: movsd - mov eax,ecx
1103251   cycles -  10 ( 75) 2: movsd - push ecx
1102906   cycles -  10 ( 59) 3: movsb
1310185   cycles -  10 (182) 4: SSE
-- unaligned strings --
2595543   cycles -  10 (  0) 0: crt_memcpy
2620959   cycles -  10 ( 75) 1: movsd - mov eax,ecx
2611443   cycles -  10 ( 75) 2: movsd - push ecx
7866087   cycles -  10 ( 59) 3: movsb
1358767   cycles -  10 (182) 4: SSE
-- short strings 15 --
 343706    cycles - 8000 (  0) 0: crt_memcpy
 789893    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
 808747    cycles - 8000 ( 75) 2: movsd - push ecx
2039809   cycles - 8000 ( 59) 3: movsb
 237595    cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543   cycles - 8000 (  0) 0: crt_memcpy
2096801   cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586   cycles - 8000 ( 75) 2: movsd - push ecx
7495329   cycles - 8000 ( 59) 3: movsb
 884593    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
  5844930   cycles - 4000 (  0) 0: crt_memcpy
  6057324   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
  5890555   cycles - 4000 ( 75) 2: movsd - push ecx
22533778  cycles - 4000 ( 59) 3: movsb
  2504176   cycles - 4000 (182) 4: SSE
--- ok ---
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 01:46:55 AM
I will assume the tipping point is then at level 4.1

from Dave's function the bits will be like this:
Code: [Select]
SSEBT_MMX equ 0
SSEBT_SSE equ 1
SSEBT_SSE2 equ 2
SSEBT_SSE3 equ 3
SSEBT_SSSE3 equ 4
SSEBT_SSE41 equ 5
SSEBT_SSE42 equ 6

and the copy function will then be like this:
Code: [Select]
bt sselevel,SSEBT_SSE41
jnc @F

mov eax,edi
align 16
rep movsb
pop edi
pop esi
ret 12

align 4
@@: ; SSE2 copy..

as implemented now it will exit if < SSE2
and using level above needs testing
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 01:52:14 AM
Code: [Select]
align 16
rep movsb

Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 02:15:52 AM
nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658 (http://masm32.com/board/index.php?topic=3373.msg35658#msg35658)

Code: [Select]
;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2

i would define the EQUates this way...

Code: [Select]
SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h

Code: [Select]
    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported

the EQUates you have would be ok for BT, i suppose   :P
Title: Re: Code location sensitivity of timings
Post by: hutch-- on July 26, 2014, 02:55:30 AM
Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.


File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 02:57:09 AM
Quote
Check if the align is really needed
I normally tune them from the list file in the end

00000000         memcpy  proc dst, src, count
00000000  56            push   esi
00000001  57            push   edi
00000002  8B7C240C         mov   edi,[esp+12]
00000006  8B742410         mov   esi,[esp+16]
0000000A  8B4C2414         mov   ecx,[esp+20]
            ifndef __SSE_
0000000E  F7C1F8FFFFFF         test   ecx,-8
00000014  7412            jz   @F
00000016  8B06            mov   eax,[esi]
00000018  8907            mov   [edi],eax
0000001A  8BC7            mov   eax,edi
0000001C  F7D8            neg   eax
0000001E  83E003         and   eax,11B
00000021  90            nop
00000022  03F8            add   edi,eax
00000024  03F0            add   esi,eax
00000026  2BC8            sub   ecx,eax
00000028         @@:
00000028  F3A4            rep   movsb
0000002A  8B44240C         mov   eax,[esp+12]
0000002E  5F            pop   edi
0000002F  5E            pop   esi
00000030  C20C00         ret   12
            endif
00000033         memcpy  endp


Quote
nidud - i hope you're using the one in this post
I'm using this one:
http://masm32.com/board/index.php?topic=3396.msg36278#msg36278

Quote
i would define the EQUates this way...
I define them this way  :biggrin:
Code: [Select]
SSE_MMX equ 00000001B
SSE_SSE equ 00000010B
SSE_SSE2 equ 00000100B
SSE_SSE3 equ 00001000B
SSE_SSSE3 equ 00010000B
SSE_SSE41 equ 00100000B
SSE_SSE42 equ 01000000B

Quote
the EQUates you have would be ok for BT, i suppose   :P

BT is the fastest upcode there is me think   :P
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 03:03:03 AM
ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX   :t
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 03:30:28 AM
when I run the program I get this:

1344 ms
1344 ms
1343 ms
2016 ms
...
1343 ms
1344 ms
1344 ms
2015 ms
...
1344 ms
1359 ms
1344 ms
2016 ms


So the size of the "code buffer" in this case manage to read 3.5 of the proc's, but in the middle of the last proc it needs to read in more code to execute. If I increase the size of each proc by inserting db 64 dup(90h) in each of them I get this result:

2031 ms
1344 ms
1344 ms
2015 ms
...
2016 ms
1359 ms
1344 ms
2016 ms
...
2015 ms
1360 ms
1344 ms
2015 ms

so now it manage to read 2.5 of the proc's...
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:38:14 AM
File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:40:43 AM
1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 04:08:58 AM
If i am not wrong, you are using 2 counters:
           First  counter     = 1000
           Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:
Code: [Select]
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
Is there any particular reason for this ?
Quote
   counter_begin 1000, HIGH_PRIORITY_CLASS
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
   .endw
   counter_end
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 05:09:09 AM
Quote
Check if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16  10476 µs
movsd align 3   10456 µs
movsd align 13  10347 µs
movsb align 16  10510 µs
movsb align 3   10503 µs
movsb align 13  10407 µs

movsd align 16  10514 µs
movsd align 3   10469 µs
movsd align 13  10516 µs
movsb align 16  10455 µs
movsb align 3   10515 µs
movsb align 13  10502 µs

movsd align 16  10526 µs
movsd align 3   10455 µs
movsd align 13  10469 µs
movsb align 16  10360 µs
movsb align 3   10485 µs
movsb align 13  10456 µs


Sample:
test4a proc uses esi edi ecx
  align 16
  nops 3
  rep movsb
  ret
test4a endp


Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 05:43:01 AM
Is there any particular reason for this ?

not really, no

The macro can only be called by EDI, ESI, or EBX or an immediate value. I think I just run out of regs once and inserted a loop. The count for small functions is also rather high so it's just a way of skipping zeros I guess.

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

that seems to be correct
not shore why, but I assume that had an affect for some reason
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 26, 2014, 08:50:17 AM
That's the result by memcpy.exe by 1234.zip:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---

Gunther
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 09:02:01 AM
Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.
    I think you are talking about this macro:
   
        counter_begin MACRO loopcount:REQ, priority
   or  counter_end

    If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)
Code: [Select]
; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
   
                BEGIN_LOOP_TEST equ <labelA>
           
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
           
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
               
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
               
                ; --------------------------------------------------
                ;          Save mean and print mean               
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
               
                ;---------------------------------------------------
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>       
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM

Code: [Select]
.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                   
                mov         eax, kMean
                mov         edx, offset _TblTiming0                   
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                   
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 02:10:39 AM
I will assume the tipping point is then at level 4.1

this appear to be false:

Intel(R) Core(TM) i3 CPU    540  @ 3.07GHz (SSE4)
----------------------------------------------
-- aligned strings --
689962     cycles -  10 (  0) 0: crt_memcpy
1434759     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1430928     cycles -  10 ( 75) 2: movsd - push ecx
3170836     cycles -  10 ( 59) 3: movsb
686200     cycles -  10 (182) 4: SSE
-- unaligned strings --
676937     cycles -  10 (  0) 0: crt_memcpy
1430499     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1430349     cycles -  10 ( 75) 2: movsd - push ecx
3157179     cycles -  10 ( 59) 3: movsb
670373     cycles -  10 (182) 4: SSE
-- short strings 15 --
200367     cycles - 8000 (  0) 0: crt_memcpy
448189     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
440419     cycles - 8000 ( 75) 2: movsd - push ecx
752747     cycles - 8000 ( 59) 3: movsb
152267     cycles - 8000 (182) 4: SSE
-- short strings 271 --
1473090     cycles - 8000 (  0) 0: crt_memcpy
1281263     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
1328604     cycles - 8000 ( 75) 2: movsd - push ecx
3323304     cycles - 8000 ( 59) 3: movsb
344338     cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1901915     cycles - 4000 (  0) 0: crt_memcpy
4364200     cycles - 4000 ( 75) 1: movsd - mov eax,ecx
4363512     cycles - 4000 ( 75) 2: movsd - push ecx
8643318     cycles - 4000 ( 59) 3: movsb
1110447     cycles - 4000 (182) 4: SSE


I wrote a test program to get the version:
SSE4.2 supported

and then used Gunther's AVX test:

      AVX check
      ---------

The CPU doesn't support AVX.
The Operating System hasn't enabled XSETBV/XGETBV instructions.
Operating System doesn't support YMM state.


so the tipping point must be AVX...
Code: [Select]
bt sselevel,SSEBT_AVX
jnc @F

rep movsb
ret

@@: ; SSE2 copy..

I added some bits to Dave's test:
Code: [Select]
SSE_XGETBV equ 00010000000B
SSE_AVX equ 00100000000B
SSE_AVX2 equ 01000000000B
SSE_AVXOS equ 10000000000B

SSEBT_XGETBV equ 7
SSEBT_AVX equ 8
SSEBT_AVX2 equ 9
SSEBT_AVXOS equ 10

    pushfd
    pop     eax
    mov     ecx,200000h
    mov     edx,eax
    xor     eax,ecx
    push    eax
    popfd
    pushfd
    pop     eax
    xor     eax,edx
    and     eax,ecx
    push    ebx
    .if !ZERO?
xor eax,eax
cpuid
.if eax
    .if ah == 5
xor eax,eax
    .else
mov eax,7
xor ecx, ecx
cpuid ; check AVX2 support
xor eax,eax
bt ebx,5 ; AVX2
rcl eax,1 ; into bit 9
push eax
mov eax,1
cpuid
pop eax
bt ecx,28 ; AVX support by CPU
rcl eax,1 ; into bit 8
bt ecx,27 ; XGETBV supported
rcl eax,1 ; into bit 7
bt ecx,20 ; SSE4.2
rcl eax,1 ; into bit 6
bt ecx,19 ; SSE4.1
rcl eax,1 ; into bit 5
bt ecx,9 ; SSSE3
rcl eax,1 ; into bit 4
bt ecx,0 ; SSE3
rcl eax,1 ; into bit 3
bt edx,26 ; SSE2
rcl eax,1 ; into bit 2
bt edx,25 ; SSE
rcl eax,1 ; into bit 1
bt ecx,0 ; MMX
rcl eax,1 ; into bit 0
    .endif
.endif
    .endif
    bt eax,SSEBT_XGETBV
    jnc  @F
    push eax
    xor  ecx,ecx
    xgetbv
    and  eax,6 ; AVX support by OS?
    pop  eax
    jz @F
    or eax,SSE_AVXOS
@@:
    pop ebx
    ret
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 02:54:54 AM
Hi nidud,

I added some bits to Dave's test:

there's nothing attached.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 04:48:41 AM
there's nothing attached.

here is a test that suppose to auto detect AVX
if it works it should select MOVSB on your machine
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 05:17:38 AM
here is the strrchr test

Code: [Select]
strrchr proc string, char

push edx
mov edx,[esp+4+4]
movzx eax,byte ptr [esp+8+4]
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
xorps xmm1,xmm1 ; clear xmm1 for compare
pshufd  xmm2,xmm2,0 ; populate char in xmm2
mov eax,edx ; keep string in EDX

align 4
lupz: movdqu  xmm0,[eax] ; get length of string
pcmpeqb xmm0,xmm1 ; compare
pmovmskb ecx,xmm0 ; get result
add eax,16
test ecx,ecx
jz lupz
bsf ecx,ecx ; set pointer to end - 16
lea eax,[eax+ecx-32]

align 4
lupc: movdqu  xmm0,[eax] ; scan in reverse for char
pcmpeqb xmm0,xmm2 ; compare
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz found
cmp eax,edx
jbe not_found
sub eax,16
jmp lupc
align 4
found:
bsr ecx,ecx
lea eax,[eax+ecx]
cmp eax,edx
jae toend
align 4
not_found:
xor eax,eax
toend:
align 4
pop edx
ret 8
strrchr endp


AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------
-- aligned strings --
493286  cycles - 10 (  0) 0: crt_strrchr
493067  cycles - 10 ( 40) 1: strrchr
215681  cycles - 10 (154) 2: x
44873   cycles - 10 (112) 3: SSE
-- unaligned strings --
496108  cycles - 10 (  0) 0: crt_strrchr
497215  cycles - 10 ( 40) 1: strrchr
217452  cycles - 10 (154) 2: x
48437   cycles - 10 (112) 3: SSE
-- small strings 128 --
155550  cycles - 500 (  0) 0: crt_strrchr
154105  cycles - 500 ( 40) 1: strrchr
65529   cycles - 500 (154) 2: x
25553   cycles - 500 (112) 3: SSE
-- small strings 1 --
27531   cycles - 500 (  0) 0: crt_strrchr
26022   cycles - 500 ( 40) 1: strrchr
9514    cycles - 500 (154) 2: x
18074   cycles - 500 (112) 3: SSE

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 27, 2014, 08:06:15 AM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933  cycles - 10 (  0) 0: crt_strrchr
995891  cycles - 10 ( 40) 1: strrchr
273823  cycles - 10 (154) 2: x
94668   cycles - 10 (112) 3: SSE
-- unaligned strings --
996477  cycles - 10 (  0) 0: crt_strrchr
997094  cycles - 10 ( 40) 1: strrchr
298219  cycles - 10 (154) 2: x
121529  cycles - 10 (112) 3: SSE
-- small strings 128 --
324263  cycles - 500 (  0) 0: crt_strrchr
323710  cycles - 500 ( 40) 1: strrchr
84786   cycles - 500 (154) 2: x
34915   cycles - 500 (112) 3: SSE
-- small strings 1 --
67914   cycles - 500 (  0) 0: crt_strrchr
67286   cycles - 500 ( 40) 1: strrchr
12595   cycles - 500 (154) 2: x
16622   cycles - 500 (112) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 10:40:16 AM
Hi nidud,

here's the output of auto.zip:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 11:12:01 PM
ok, that's almost perfect   :P

the speed of short strings seems to break even between 1000 and 2000 bytes
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
mov eax,edi ; return value

cmp ecx,1500 ; use SSE on short strings
jb SSE2
bt sselevel,SSEBT_AVX
jnc SSE2

rep movsb
ret

align 4
SSE2:
Title: Re: Code location sensitivity of timings
Post by: nidud on August 11, 2014, 08:46:40 PM
I implemented some of the SSE function in the library and did some testing. Most (if not all) of them failed the regression test so a few adjustments needed to be made. The problem with copying memory in this case was overlapping strings and alignment of the pointers head and tail bytes. Using 16 byte blocks may create an overlap of 31 byte for alignment and tail bytes and not 15 as assumed. The handling of the tail bytes also needed a fixup:
Code: [Select]
; wrong:
movq xmm0,[esi] ; move 8..15 byte
movq [edi],xmm0 ; |8...|
movq xmm0,[esi+ecx-8] ; |...8|
movq [edi+ecx-8],xmm0
; correct:
movq xmm0,[esi] ; move 8..15 byte
movq xmm1,[esi+ecx-8] ; |...8|
movq [edi],xmm0 ; |8...|
movq [edi+ecx-8],xmm1

However, the solution was more or less equally fast as the first one, and now it also handle overlapping copy(m,m+1) and (m+1,m). memmove is now equ <memcpy> and this is the version I ended up using:
Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

ifdef __SSE__

strcpy  proc dst:ptr byte, src:ptr byte
mov ecx,esp
push [ecx]
mov eax,[ecx+4]
mov [ecx],eax
mov eax,[ecx+8]
mov [ecx+4],eax
xorps xmm1,xmm1
@@: movdqu  xmm0,[eax]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
add eax,16
test ecx,ecx
jz @B
bsf ecx,ecx
sub eax,[esp+8]
lea eax,[eax+ecx-15]
mov [esp+12],eax
strcpy  endp

endif

memcpy  proc dst, src, count
push esi
push edi
mov edi,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
ifdef __SSE__
 ifdef __AVX__
cmp ecx,1500
jb SSE2
bt sselevel,SSEBT_AVX
jnc SSE2
mov eax,edi
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
align 4
 @@: lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
align 4
 SSE2:
 endif
movdqu  xmm2,[esi] ; save align bytes
test ecx,-32 ; need 31 byte for overlap..
jz tail
push edx
movdqu  xmm3,[esi+ecx-16] ; save tail bytes
mov eax,esi ; align ESI 16
neg eax
and eax,1111B
mov edx,esi
sub edx,edi
cmp edx,ecx
mov edx,ecx ; save count in EDX
jbe overlapped
sub ecx,eax
add esi,eax
xchg edi,eax
add edi,eax ; return address to EAX
and ecx,-16 ; align ECX 16
align 4
@@: sub ecx,16
movdqa  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
jnz @B
movdqu  [eax],xmm2 ; fix tail and aligned bytes
movdqu  [eax+edx-16],xmm3
pop edx
pop edi
pop esi
ret 12
align 4
overlapped:
sub ecx,eax
and ecx,-16 ; align ECX 16
add eax,ecx
add esi,eax
xchg edi,eax
add edi,eax ; return address to EAX
neg ecx
align 4
@@: movdqa  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
add ecx,16
jnz @B
movdqu  [eax],xmm2 ; fix tail and aligned bytes
movdqu  [eax+edx-16],xmm3
pop edx
pop edi
pop esi
ret 12
align 4
tail: test ecx,ecx
jz toend ; 0
test ecx,-2
jz @1 ; 1
test ecx,-4
jz @2 ; 2..3
test ecx,-8
jz @4 ; 4..7
test ecx,-16
jz @8 ; 8..15
movdqu  xmm1,[esi+ecx-16]
movdqu  [edi],xmm2 ; 16..31
movdqu  [edi+ecx-16],xmm1
align 4
toend:  mov eax,edi
pop edi
pop esi
ret 12
align 4
@8: movq xmm1,[esi+ecx-8]
movq [edi],xmm2 ; 8..15
movq [edi+ecx-8],xmm1
jmp toend
align 4
@4: mov eax,[esi]
mov esi,[esi+ecx-4]
mov [edi],eax
mov [edi+ecx-4],esi
jmp toend
align 4
@2: mov eax,[esi]
mov [edi],ax
cmp ecx,3
jb toend
shr eax,16
mov [edi+2],al
jmp toend
align 4
@1: mov al,[esi]
mov [edi],al
jmp toend
else
mov eax,edi
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
align 4
 @@: lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
endif
memcpy  endp

This enable a simple way of inserting or exchange text in a buffer:
Code: [Select]
strcpy(head+length,tail) ; make room for new string
memcpy(head,string,length) ; insert new string

For search and replace ("%PATH%", "C:\MASM32\BIN") head is result from strstri(), tail is head+sizeof("%PATH%"), and head+length is > tail.

Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

strstri proc dst:ptr byte, src:ptr byte

mov eax,esp
push edx
push ebx
push edi
mov edx,[eax+4]
mov ebx,[eax+8]

movzx eax,byte ptr [ebx]
or al,20h
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
pshufd  xmm2,xmm2,0 ; populate char
mov eax,20202020h
movd xmm3,eax
pshufd  xmm3,xmm3,0 ; populate 20h for case
xorps xmm4,xmm4 ; clear xmm2 for compare

align 4
lup: movdqu  xmm0,[edx]
movdqa  xmm1,xmm0
pcmpeqb xmm1,xmm4 ; test for zero
pmovmskb ecx,xmm1
orps xmm0,xmm3 ; remove case..
pcmpeqb xmm0,xmm2 ; test for char
pmovmskb eax,xmm0
lea edx,[edx+16]
or ecx,eax
jz lup

bsf ecx,ecx
lea edx,[edx+ecx-16]
cmp byte ptr [edx],0
je not_found

xor ecx,ecx
lea eax,[ebx+1]
mov edi,edx
inc edx

align 4
lup2: xor cl,[eax]
jz found
mov ch,[edx]
or cx,2020h
sub cl,ch
jnz lup
inc eax
inc edx
jmp lup2

align 4
not_found:
xor eax,eax
jmp toend

align 4
found:  mov ecx,eax
mov eax,edi
sub ecx,ebx
test eax,eax

align 4
toend:
pop edi
pop ebx
pop edx
ret 8
strstri  endp

I added a System Information box to show the SSE level using Dave's test, and on my AMD I get this information:
Code: [Select]
  │        Streaming SIMD Extensions: [x] SSE    [ ] SSE4.1 │
  │                                   [x] SSE2   [ ] SSE4.2 │
  │                                   [x] SSE3   [ ] AVX    │
  │        [ ] AVX supported by OS    [ ] SSSE3  [ ] AVX2   │
  └─────────────────────────────────────────────────────────┘

However, on a Intel i3 I get this:
Code: [Select]
  │        Streaming SIMD Extensions: [x] SSE    [x] SSE4.1 │
  │                                   [x] SSE2   [ ] SSE4.2 │
  │                                   [ ] SSE3   [ ] AVX    │
  │        [ ] AVX supported by OS    [ ] SSSE3  [ ] AVX2   │
  └─────────────────────────────────────────────────────────┘

Is this possible ? to have SSE4.1 and not SSE3 ?

Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 01:58:59 AM
Hi nidud,

Is this possible ? to have SSE4.1 and not SSE3 ?

Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.

I think not. Did you try that (http://masm32.com/board/index.php?topic=1418.msg14444#msg14444)? It should show you the available instruction sets.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 03:01:17 AM
I tested a laptop with this result:
Quote
                Supported Features
                ==================

Vendor String: GenuineIntel
Brand  String: Intel(R)Core(TM)i3-2367MCPU@1.40GHz

                Instruction Sets
                ----------------

MMX  SSE  SSE2  SSE3  SSSE3  SSE4.1  SSE4.2  AVX

                Supported Special Instructions
                ------------------------------

Conditional Moves
FXSAVE and FXSTOR
XSAVE and XSTOR for processor extended state management.
POPCNT

and the my test now show this result:
Code: [Select]
  │        Streaming SIMD Extensions: [x] SSE    [ ] SSE4.1 │
  │                                   [x] SSE2   [ ] SSE4.2 │
  │                                   [ ] SSE3   [ ] AVX    │
  │        [x] AVX supported by OS    [ ] SSSE3  [ ] AVX2   │
  └─────────────────────────────────────────────────────────┘

And here is the problem  :lol:
Code: [Select]
    mov cl,'x'
    mov eax,sselevel
    .if eax & SSE_AVXOS
mov [edx+906],cl
    .elseif eax & SSE_AVX2
mov [edx+982],cl
    .elseif eax & SSE_AVX
mov [edx+856],cl
    .elseif eax & SSE_SSE42
mov [edx+730],cl
    .elseif eax & SSE_SSE41
mov [edx+604],cl
    .elseif eax & SSE_SSSE3
mov [edx+960],cl
    .elseif eax & SSE_SSE3
mov [edx+834],cl
    .endif
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 08:49:00 AM
Hi nidud,

you can trust my instruction detecting application. Your laptop supports in any case SSE3 and SSSE3 and it supports AVX. You can test that with that tool (http://masm32.com/board/index.php?topic=3227.msg35958#msg35958), if you've at least Windows 7 with SP1 installed. The glitch must be in your code. Do you test the right bits?

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 09:27:19 AM
The glitch must be in your code. Do you test the right bits?

I did, but I used an .elseif in the code, so only one 'x' was set  :P
This is the code that insert the 'x' in the dialog:
And here is the problem  :lol:
Code: [Select]
    mov cl,'x'
    mov eax,sselevel
    .if eax & SSE_AVXOS
mov [edx+906],cl
    .elseif eax & SSE_AVX2
mov [edx+982],cl
    .elseif eax & SSE_AVX
mov [edx+856],cl
    .elseif eax & SSE_SSE42
mov [edx+730],cl
    .elseif eax & SSE_SSE41
mov [edx+604],cl
    .elseif eax & SSE_SSSE3
mov [edx+960],cl
    .elseif eax & SSE_SSE3
mov [edx+834],cl
    .endif

The correct code is:
Code: [Select]
    mov cl,'x'
    mov eax,sselevel
    .if eax & SSE_AVXOS
mov [edx+906],cl
    .endif
    .if eax & SSE_AVX2
mov [edx+982],cl
    .endif
    .if eax & SSE_AVX
mov [edx+856],cl
    .endif
    .if eax & SSE_SSE42
mov [edx+730],cl
    .endif
    .if eax & SSE_SSE41
mov [edx+604],cl
    .endif
    .if eax & SSE_SSSE3
mov [edx+960],cl
    .endif
    .if eax & SSE_SSE3
mov [edx+834],cl
    .endif

So the test works   :t
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 12, 2014, 10:00:55 AM
Hi Gunther,

My Core-i3 G3220 does not support AVX, but the results for your instruction set detection tool:
Code: [Select]
Supported by Processor and installed Operating System:
------------------------------------------------------

     MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
     POPCNT, SSE4.2

     featurenumber = 13

Appear to match the Intel specs:

http://ark.intel.com/products/77773
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 11:02:35 AM
Hi Michael,

Appear to match the Intel specs:

http://ark.intel.com/products/77773

I hope so. I've written the procedure using the Intel documents as a basis.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 09:39:19 PM
here is the timings for memchr

note:
- XOR ECX,EBX is used to zero out the bytes
- SUB ECX,EBX fails as used before

Code: [Select]
memchr proc dst, char, count
mov eax,esp
push edx
push edi
mov edx,[eax+4]
push ebx
mov edi,[eax+12]
movzx ebx,byte ptr [eax+8]
mov bh,bl
mov ecx,ebx
shl ebx,16
add ebx,ecx ; populate char
add edi,edx ; limit
align 4
lup: cmp edx,edi
jae fail
mov ecx,[edx]
add edx,4
xor ecx,ebx
lea eax,[ecx-01010101h]
not ecx
and eax,ecx
and eax,80808080h
jz lup
bsf eax,eax
shr eax,3
lea eax,[eax+edx-4]
cmp eax,edi
jb toend
fail: xor eax,eax
toend:  pop ebx
pop edi
pop edx
ret 12
memchr endp

the SSE version:
Code: [Select]
memchr  proc dst, char, count
movzx eax,byte ptr [esp+8]
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm0,eax
pshufd  xmm0,xmm0,0 ; populate char
push edx
mov edx,[esp+8] ; buffer
mov ecx,[esp+16] ; length
add ecx,edx ; limit
align 4
lup: cmp edx,ecx
jae fail
movdqu  xmm1,[edx]
pcmpeqb xmm1,xmm0 ; test for char
pmovmskb eax,xmm1
add edx,16
test eax,eax
jz lup
bsf eax,eax
lea eax,[eax+edx-16]
cmp eax,ecx
jb toend
fail: xor eax,eax ; return ZERO
toend:  pop edx ; use EAX,ECX
ret 12
memchr  endp

Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
--------------------------------------------
-- aligned strings --
165830    cycles -  40 (  0) 0: crt_memchr
330444    cycles -  40 ( 31) 1: memchr
166583    cycles -  40 ( 78) 2: x
33046     cycles -  40 ( 76) 3: SSE
-- unaligned strings --
167060    cycles -  40 (  0) 0: crt_memchr
329403    cycles -  40 ( 31) 1: memchr
166044    cycles -  40 ( 78) 2: x
36679     cycles -  40 ( 76) 3: SSE
-- short strings --
160039    cycles - 5000 (  0) 0: crt_memchr
326120    cycles - 5000 ( 31) 1: memchr
189276    cycles - 5000 ( 78) 2: x
123239    cycles - 5000 ( 76) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 18, 2014, 10:36:29 PM
Code: [Select]
  movzx eax, byte ptr [esp+8]
  if 1
imul eax, eax, 01010101h ; 4 bytes shorter, faster
  else
mov ah, al
mov ecx, eax
shl eax, 16
add eax, ecx
  endif
  movd xmm0, eax
  pshufd xmm0, xmm0, 0 ; populate char
Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 11:40:39 PM
Code: [Select]
  movzx eax, byte ptr [esp+8]
  if 1
imul eax, eax, 01010101h ; 4 bytes shorter, faster
  else
mov ah, al
mov ecx, eax
shl eax, 16
add eax, ecx
  endif
  movd xmm0, eax
  pshufd xmm0, xmm0, 0 ; populate char
thanks, I give that a try

here is the timings for the search functions

buffer and search string may include zero
so the length have to be passed for both

I label them memstr and memstri
Code: [Select]
memstr  proc s1, l1, s2, l2
mov eax,esp
push ebp
push esi
        mov     esi,[eax+8]
push edi
        mov     edi,[eax+4]
push ebx
        mov     ebx,[eax+12]
mov ebp,[eax+16]
add esi,edi ; limit 1
movzx eax,byte ptr [ebx]
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
pshufd  xmm2,xmm2,0 ; populate char in xmm2
        push    edx
        add     ebp,ebx         ; limit 2
align 16
lup: cmp edi,esi
jae fail
movdqu  xmm0,[edi]
pcmpeqb xmm0,xmm2 ; strchr loop
pmovmskb eax,xmm0
add edi,16
test eax,eax
jz lup
bsf eax,eax
lea edi,[eax+edi-16]
cmp edi,esi
ja fail
mov ecx,ebx
mov edx,edi
inc edi
align 4
@@: cmp ecx,ebp
ja lup
movdqu  xmm0,[ecx]
movdqu  xmm1,[edx]
pcmpeqb xmm1,xmm0 ; compare
pmovmskb eax,xmm1
add edx,16
add ecx,16
test eax,eax
jz @B
not eax
bsf eax,eax
lea ecx,[eax+ecx-16]
cmp ecx,ebp
jb lup
mov eax,edi
dec eax
jmp toend
align 4
fail: xor eax,eax
align 4
toend:  pop edx
pop ebx
pop edi
pop esi
pop ebp
ret 16
memstr  endp

Code: [Select]
memstri proc s1, l1, s2, l2
        mov     eax,esp
        push    ebp
        push    esi
        mov     esi,[eax+8]
        push    edi
        mov     edi,[eax+4]
        push    ebx
        mov     ebx,[eax+12]
        mov     ebp,[eax+16]
movzx eax,byte ptr [ebx]
or al,20h
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
pshufd  xmm2,xmm2,0 ; populate char in xmm2
mov eax,20202020h
movd xmm3,eax
pshufd  xmm3,xmm3,0 ; populate case
push edx
lea esi,[esi+edi]   ; limit 1
lea ebp,[ebp+ebx]   ; limit 2
align 16
lup: cmp edi,esi
ja fail
movdqu  xmm0,[edi]
orps xmm0,xmm3
pcmpeqb xmm0,xmm2 ; strchr loop
pmovmskb eax,xmm0
add edi,16
test eax,eax
jz lup
bsf eax,eax
lea edi,[eax+edi-16]
cmp edi,esi
ja fail
mov ecx,ebx
mov edx,edi
inc edi
align 4
@@: cmp ecx,ebp
ja lup
movdqu  xmm0,[ecx]
movdqu  xmm1,[edx]
orps xmm0,xmm3
orps xmm1,xmm3
align 4
pcmpeqb xmm1,xmm0 ; compare
pmovmskb eax,xmm1
add edx,16
add ecx,16
test eax,eax
jz @B
not eax
bsf eax,eax
lea ecx,[eax+ecx-16]
cmp ecx,ebp
jb lup
mov eax,edi
dec eax
jmp toend
align 4
fail: xor eax,eax
align 4
toend:  pop edx
pop ebx
pop edi
pop esi
pop ebp
ret 16
memstri endp

there are no reference to these functions
a scasb/cmpsb is used in the first  test
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------
-- aligned strings --
432873    cycles -  40 ( 65) 1: memstr
211434    cycles -  40 (131) 2: x
92919     cycles -  40 (148) 3: SSE
-- unaligned strings --
433122    cycles -  40 ( 65) 1: memstr
226451    cycles -  40 (131) 2: x
96972     cycles -  40 (148) 3: SSE
-- short strings 22 --
198281    cycles - 2000 ( 65) 1: memstr
130811    cycles - 2000 (131) 2: x
110509    cycles - 2000 (148) 3: SSE

memstri:
Code: [Select]
-- aligned strings --
218602    cycles -  40 (143) 2: x
102205    cycles -  40 (176) 3: SSE
-- unaligned strings --
234142    cycles -  40 (143) 2: x
104991    cycles -  40 (176) 3: SSE
-- short strings 22 --
140906    cycles - 2000 (143) 2: x
117958    cycles - 2000 (176) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 01:40:54 AM
Variants of memchr:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

43778   cycles for 100 * memchr scasb
4474    cycles for 100 * memchr SSE2a
5608    cycles for 100 * memchr SSE2b

43994   cycles for 100 * memchr scasb
4497    cycles for 100 * memchr SSE2a
5602    cycles for 100 * memchr SSE2b

44044   cycles for 100 * memchr scasb
4474    cycles for 100 * memchr SSE2a
5598    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b


Could look much different on other CPUs, as movlps speeds it up a lot on my CPU.
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 02:27:49 AM
My processor is near (or at) bottom-end today (retail box, $79).
Code: [Select]
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24909   cycles for 100 * memchr scasb
2864    cycles for 100 * memchr SSE2a
2399    cycles for 100 * memchr SSE2b

24934   cycles for 100 * memchr scasb
2882    cycles for 100 * memchr SSE2a
2366    cycles for 100 * memchr SSE2b

24923   cycles for 100 * memchr scasb
2886    cycles for 100 * memchr SSE2a
2418    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b

96      = eax memchr scasb
96      = eax memchr SSE2a
96      = eax memchr SSE2b
Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 03:12:24 AM
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)

22050   cycles for 100 * memchr scasb
7107    cycles for 100 * memchr SSE2a
4388    cycles for 100 * memchr SSE2b

21988   cycles for 100 * memchr scasb
7105    cycles for 100 * memchr SSE2a
4355    cycles for 100 * memchr SSE2b

22311   cycles for 100 * memchr scasb
7078    cycles for 100 * memchr SSE2a
4351    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b

96      = eax memchr scasb
96      = eax memchr SSE2a
96      = eax memchr SSE2b

maybe the scasb is faster on newer CPU's with AVX ?
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 03:39:31 AM
Thanks. As I suspected, the movlps/movhps pair is good only for my trusty Celeron  :(
Here is one more, with movups instead:
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
43821   cycles for 100 * memchr scasb
4477    cycles for 100 * memchr SSE2 lps/hps
5556    cycles for 100 * memchr SSE2 nidud
5205    cycles for 100 * memchr SSE2 ups

43778   cycles for 100 * memchr scasb
4476    cycles for 100 * memchr SSE2 lps/hps
5606    cycles for 100 * memchr SSE2 nidud
5206    cycles for 100 * memchr SSE2 ups

43762   cycles for 100 * memchr scasb
4482    cycles for 100 * memchr SSE2 lps/hps
5607    cycles for 100 * memchr SSE2 nidud
5200    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 04:07:35 AM
Code: [Select]
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24916   cycles for 100 * memchr scasb
2889    cycles for 100 * memchr SSE2 lps/hps
2422    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

24927   cycles for 100 * memchr scasb
2890    cycles for 100 * memchr SSE2 lps/hps
2469    cycles for 100 * memchr SSE2 nidud
2342    cycles for 100 * memchr SSE2 ups

24921   cycles for 100 * memchr scasb
2885    cycles for 100 * memchr SSE2 lps/hps
2405    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups
Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 05:06:01 AM
I made an extended version of the timer
to test specific sections of code

in this case I tested file search
the target is 38000 files in 4300 directories

this is a console window (TrueType font)
Code: [Select]
Timings for FileSearch:
 0:   345674274 - main
 1: 4159852 - memstri(38000)
 2:    80710236 - print directory(4300)
 3:   129571088 - open/read/close(38000)

this is in full screen mode (system font)
Code: [Select]
Timings for FileSearch:
 0:   248125882 - total
 1: 4042188 - search(38000)
 2:    16447826 - output(4300)
 3:   126074116 - IO
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 19, 2014, 05:08:29 AM
Jochen,

your timings:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

21892   cycles for 100 * memchr scasb
3007    cycles for 100 * memchr SSE2 lps/hps
2690    cycles for 100 * memchr SSE2 nidud
2500    cycles for 100 * memchr SSE2 ups

21951   cycles for 100 * memchr scasb
2981    cycles for 100 * memchr SSE2 lps/hps
2721    cycles for 100 * memchr SSE2 nidud
6211    cycles for 100 * memchr SSE2 ups

21827   cycles for 100 * memchr scasb
3003    cycles for 100 * memchr SSE2 lps/hps
2510    cycles for 100 * memchr SSE2 nidud
2721    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

--- ok ---

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on March 22, 2015, 02:17:44 AM
Some notes about the SIMD functions in this tread.

As pointed out by rrr in this post (http://masm32.com/board/index.php?topic=4067.msg43127#msg43127) reading ahead more than one byte may read into a protected memory page and blow up.

The situation will be like this:
Code: [Select]
[PAGE_NOACCESS][H -- current page -- T][PAGE_NOACCESS]
Reading above the Tail byte or below the Head byte will fail in this case. If the pointer is the address of T and the function starts by reading 4 bytes it will overlap the protected area by 3 bytes.

A page is aligned 4096 (1000h) so the offset of T is xFFF, and the offset of H is x000. The fix is then to align to chunk size. In the case of 4 the first read will be xFFC, and in the case of 16 xFF0.

In the last case 15 bytes needs to be removed from the first test, so this creates an overhead on all these functions where the pointer needs to be aligned to chunk size before the fun begins.
Code: [Select]
mov ecx,eax
and ecx,16-1 ; unaligned bytes - 0..15
and eax,-16 ; align 16
or edx,-1 ; create mask
shl edx,cl
pxor xmm0,xmm0 ; populate zero in xmm0
pcmpeqb xmm0,[eax] ; check for zero
pmovmskb ecx,xmm0
and ecx,edx ; remove bytes in front
jnz done
pxor xmm0,xmm0
main_loop:
add eax,16
...

Here is an updated version (http://masm32.com/board/index.php?topic=4067.msg43334#msg43334) of strlen using 32 byte

This is a "page safe" version strchr:
Code: [Select]
.686
.model flat, stdcall
.xmm
.code

OPTION PROLOGUE:NONE, EPILOGUE:NONE

strchr  proc string, char
push edi
push edx
mov ecx,8[esp+4]
mov eax,8[esp+8]
and eax,000000FFh
imul eax,eax,01010101h
movd xmm1,eax
pshufd  xmm1,xmm1,0 ; populate char in xmm1
xorps xmm0,xmm0 ; populate zero in xmm0
mov edi,ecx
and ecx,16-1 ; unaligned bytes - 0..15
and edi,-16 ; align EDI 16
movdqa  xmm2,[edi] ; load 16 bytes
movdqa  xmm3,xmm2 ; copy to xmm3
or edx,-1 ; create mask
shl edx,cl
pcmpeqb xmm3,xmm1 ; check for char
pmovmskb ecx,xmm3
pcmpeqb xmm2,xmm0 ; check for zero
pmovmskb eax,xmm2
or ecx,eax ; combine result
and ecx,edx ; remove bytes in front
jnz done
@@:
movdqa  xmm2,[edi+16] ; continue testing 16-byte blocks
movdqa  xmm3,xmm2
lea edi,[edi+16]
pcmpeqb xmm3,xmm1
pmovmskb ecx,xmm3
pcmpeqb xmm2,xmm0
pmovmskb eax,xmm2
or ecx,eax
jz @B
done:
bsf ecx,ecx ; get index of bit
lea eax,[edi+ecx] ; load address to result
cmp byte ptr [eax],1 ; points to zero or char
sbb ecx,ecx ; if zero CARRY flag set
not ecx ; 0 or -1
and eax,ecx ; 0 or pointer
pop edx
pop edi
ret 8
strchr  endp

END

This is a new "safe" version of memcpy using 32 byte. The problem here was reading 16 bytes before the size test.
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
mov eax,[esp+12] ; dst -- return value
mov esi,[esp+16] ; src
mov ecx,[esp+20] ; count
cmp ecx,64 ; skip short strings
jb copy_0_31
movdqu  xmm3,[esi] ; save aligned and tail bytes
movdqu  xmm4,[esi+16]
movdqu  xmm5,[esi+ecx-16]
movdqu  xmm6,[esi+ecx-32]
mov edi,eax ; align pointer
neg edi
and edi,32-1
add esi,edi
sub ecx,edi
add edi,eax
and ecx,-32
cmp esi,edi ; get direction of copy
ja move_R
loop_L:
sub ecx,32
movdqu  xmm1,[esi+ecx] ; copy 32 bytes
movdqu  xmm2,[esi+ecx+16]
movdqa  [edi+ecx],xmm1
movdqa  [edi+ecx+16],xmm2
jnz loop_L
jmp done
db 13 dup(90h) ; align loop_R 16
move_R:
add edi,ecx
add esi,ecx
neg ecx
loop_R:
movdqu  xmm1,[esi+ecx]
movdqu  xmm2,[esi+ecx+16]
movdqa  [edi+ecx],xmm1
movdqa  [edi+ecx+16],xmm2
add ecx,dword ptr 32
jnz loop_R
done:
mov ecx,[esp+20] ; fixup after copy
movdqu  [eax],xmm3 ; 0..31 unaligned from start
movdqu  [eax+16],xmm4 ;
movdqu  [eax+ecx-16],xmm5 ; 0..31 unaligned tail bytes
movdqu  [eax+ecx-32],xmm6 ;
toend:
pop edi
pop esi
ret 12
;----------------------------------------------------------------------
; Copy 0..63 byte
;----------------------------------------------------------------------
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2
test ecx,-8
jz copy_4
test ecx,-16
jz copy_8
test ecx,-32
movdqu  xmm1,[esi]
movdqu  xmm3,[esi+ecx-16]
jz copy_16
movdqu  xmm2,[esi+16]
movdqu  xmm4,[esi+ecx-32]
movdqu  [eax+ecx-32],xmm4
movdqu  [eax+16],xmm2
copy_16:
movdqu  [eax],xmm1
movdqu  [eax+ecx-16],xmm3
jmp toend
copy_8:
movq xmm2,[esi]
movq xmm1,[esi+ecx-8]
movq [eax],xmm2
movq [eax+ecx-8],xmm1
jmp toend
copy_4:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
copy_2:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

With regards to using 2 pointers, this may be difficult to do safely. Consider strcmp(T, H) where T and H are aligned –15, the Head pointer will overlap the page by 15 bytes. This is in above samples done by reading (ahead) chunks of 16 bytes and compare which is not very safe.