Author Topic: Code location sensitivity of timings  (Read 30100 times)

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Code location sensitivity of timings
« Reply #45 on: July 26, 2014, 03:38:14 AM »
File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Code location sensitivity of timings
« Reply #46 on: July 26, 2014, 03:40:43 AM »
1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Code location sensitivity of timings
« Reply #47 on: July 26, 2014, 04:08:58 AM »
If i am not wrong, you are using 2 counters:
           First  counter     = 1000
           Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:
Code: [Select]
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
Is there any particular reason for this ?
Quote
   counter_begin 1000, HIGH_PRIORITY_CLASS
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
   .endw
   counter_end

jj2007

  • Member
  • *****
  • Posts: 10545
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #48 on: July 26, 2014, 05:09:09 AM »
Quote
Check if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16  10476 µs
movsd align 3   10456 µs
movsd align 13  10347 µs
movsb align 16  10510 µs
movsb align 3   10503 µs
movsb align 13  10407 µs

movsd align 16  10514 µs
movsd align 3   10469 µs
movsd align 13  10516 µs
movsb align 16  10455 µs
movsb align 3   10515 µs
movsb align 13  10502 µs

movsd align 16  10526 µs
movsd align 3   10455 µs
movsd align 13  10469 µs
movsb align 16  10360 µs
movsb align 3   10485 µs
movsb align 13  10456 µs


Sample:
test4a proc uses esi edi ecx
  align 16
  nops 3
  rep movsb
  ret
test4a endp


Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #49 on: July 26, 2014, 05:43:01 AM »
Is there any particular reason for this ?

not really, no

The macro can only be called by EDI, ESI, or EBX or an immediate value. I think I just run out of regs once and inserted a loop. The count for small functions is also rather high so it's just a way of skipping zeros I guess.

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

that seems to be correct
not shore why, but I assume that had an affect for some reason

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #50 on: July 26, 2014, 08:50:17 AM »
That's the result by memcpy.exe by 1234.zip:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Code location sensitivity of timings
« Reply #51 on: July 26, 2014, 09:02:01 AM »
Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.
    I think you are talking about this macro:
   
        counter_begin MACRO loopcount:REQ, priority
   or  counter_end

    If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)
Code: [Select]
; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
   
                BEGIN_LOOP_TEST equ <labelA>
           
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
           
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
               
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
               
                ; --------------------------------------------------
                ;          Save mean and print mean               
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
               
                ;---------------------------------------------------
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>       
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM

Code: [Select]
.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                   
                mov         eax, kMean
                mov         edx, offset _TblTiming0                   
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                   
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp
« Last Edit: July 26, 2014, 08:28:45 PM by RuiLoureiro »

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #52 on: July 27, 2014, 02:10:39 AM »
I will assume the tipping point is then at level 4.1

this appear to be false:

Intel(R) Core(TM) i3 CPU    540  @ 3.07GHz (SSE4)
----------------------------------------------
-- aligned strings --
689962     cycles -  10 (  0) 0: crt_memcpy
1434759     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1430928     cycles -  10 ( 75) 2: movsd - push ecx
3170836     cycles -  10 ( 59) 3: movsb
686200     cycles -  10 (182) 4: SSE
-- unaligned strings --
676937     cycles -  10 (  0) 0: crt_memcpy
1430499     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1430349     cycles -  10 ( 75) 2: movsd - push ecx
3157179     cycles -  10 ( 59) 3: movsb
670373     cycles -  10 (182) 4: SSE
-- short strings 15 --
200367     cycles - 8000 (  0) 0: crt_memcpy
448189     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
440419     cycles - 8000 ( 75) 2: movsd - push ecx
752747     cycles - 8000 ( 59) 3: movsb
152267     cycles - 8000 (182) 4: SSE
-- short strings 271 --
1473090     cycles - 8000 (  0) 0: crt_memcpy
1281263     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
1328604     cycles - 8000 ( 75) 2: movsd - push ecx
3323304     cycles - 8000 ( 59) 3: movsb
344338     cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1901915     cycles - 4000 (  0) 0: crt_memcpy
4364200     cycles - 4000 ( 75) 1: movsd - mov eax,ecx
4363512     cycles - 4000 ( 75) 2: movsd - push ecx
8643318     cycles - 4000 ( 59) 3: movsb
1110447     cycles - 4000 (182) 4: SSE


I wrote a test program to get the version:
SSE4.2 supported

and then used Gunther's AVX test:

      AVX check
      ---------

The CPU doesn't support AVX.
The Operating System hasn't enabled XSETBV/XGETBV instructions.
Operating System doesn't support YMM state.


so the tipping point must be AVX...
Code: [Select]
bt sselevel,SSEBT_AVX
jnc @F

rep movsb
ret

@@: ; SSE2 copy..

I added some bits to Dave's test:
Code: [Select]
SSE_XGETBV equ 00010000000B
SSE_AVX equ 00100000000B
SSE_AVX2 equ 01000000000B
SSE_AVXOS equ 10000000000B

SSEBT_XGETBV equ 7
SSEBT_AVX equ 8
SSEBT_AVX2 equ 9
SSEBT_AVXOS equ 10

    pushfd
    pop     eax
    mov     ecx,200000h
    mov     edx,eax
    xor     eax,ecx
    push    eax
    popfd
    pushfd
    pop     eax
    xor     eax,edx
    and     eax,ecx
    push    ebx
    .if !ZERO?
xor eax,eax
cpuid
.if eax
    .if ah == 5
xor eax,eax
    .else
mov eax,7
xor ecx, ecx
cpuid ; check AVX2 support
xor eax,eax
bt ebx,5 ; AVX2
rcl eax,1 ; into bit 9
push eax
mov eax,1
cpuid
pop eax
bt ecx,28 ; AVX support by CPU
rcl eax,1 ; into bit 8
bt ecx,27 ; XGETBV supported
rcl eax,1 ; into bit 7
bt ecx,20 ; SSE4.2
rcl eax,1 ; into bit 6
bt ecx,19 ; SSE4.1
rcl eax,1 ; into bit 5
bt ecx,9 ; SSSE3
rcl eax,1 ; into bit 4
bt ecx,0 ; SSE3
rcl eax,1 ; into bit 3
bt edx,26 ; SSE2
rcl eax,1 ; into bit 2
bt edx,25 ; SSE
rcl eax,1 ; into bit 1
bt ecx,0 ; MMX
rcl eax,1 ; into bit 0
    .endif
.endif
    .endif
    bt eax,SSEBT_XGETBV
    jnc  @F
    push eax
    xor  ecx,ecx
    xgetbv
    and  eax,6 ; AVX support by OS?
    pop  eax
    jz @F
    or eax,SSE_AVXOS
@@:
    pop ebx
    ret

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #53 on: July 27, 2014, 02:54:54 AM »
Hi nidud,

I added some bits to Dave's test:

there's nothing attached.

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #54 on: July 27, 2014, 04:48:41 AM »
there's nothing attached.

here is a test that suppose to auto detect AVX
if it works it should select MOVSB on your machine

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #55 on: July 27, 2014, 05:17:38 AM »
here is the strrchr test

Code: [Select]
strrchr proc string, char

push edx
mov edx,[esp+4+4]
movzx eax,byte ptr [esp+8+4]
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
xorps xmm1,xmm1 ; clear xmm1 for compare
pshufd  xmm2,xmm2,0 ; populate char in xmm2
mov eax,edx ; keep string in EDX

align 4
lupz: movdqu  xmm0,[eax] ; get length of string
pcmpeqb xmm0,xmm1 ; compare
pmovmskb ecx,xmm0 ; get result
add eax,16
test ecx,ecx
jz lupz
bsf ecx,ecx ; set pointer to end - 16
lea eax,[eax+ecx-32]

align 4
lupc: movdqu  xmm0,[eax] ; scan in reverse for char
pcmpeqb xmm0,xmm2 ; compare
pmovmskb ecx,xmm0 ; get result
test ecx,ecx
jnz found
cmp eax,edx
jbe not_found
sub eax,16
jmp lupc
align 4
found:
bsr ecx,ecx
lea eax,[eax+ecx]
cmp eax,edx
jae toend
align 4
not_found:
xor eax,eax
toend:
align 4
pop edx
ret 8
strrchr endp


AMD Athlon(tm) II X2 245 Processor (SSE3)
-----------------------------------------
-- aligned strings --
493286  cycles - 10 (  0) 0: crt_strrchr
493067  cycles - 10 ( 40) 1: strrchr
215681  cycles - 10 (154) 2: x
44873   cycles - 10 (112) 3: SSE
-- unaligned strings --
496108  cycles - 10 (  0) 0: crt_strrchr
497215  cycles - 10 ( 40) 1: strrchr
217452  cycles - 10 (154) 2: x
48437   cycles - 10 (112) 3: SSE
-- small strings 128 --
155550  cycles - 500 (  0) 0: crt_strrchr
154105  cycles - 500 ( 40) 1: strrchr
65529   cycles - 500 (154) 2: x
25553   cycles - 500 (112) 3: SSE
-- small strings 1 --
27531   cycles - 500 (  0) 0: crt_strrchr
26022   cycles - 500 ( 40) 1: strrchr
9514    cycles - 500 (154) 2: x
18074   cycles - 500 (112) 3: SSE


jj2007

  • Member
  • *****
  • Posts: 10545
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #56 on: July 27, 2014, 08:06:15 AM »
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933  cycles - 10 (  0) 0: crt_strrchr
995891  cycles - 10 ( 40) 1: strrchr
273823  cycles - 10 (154) 2: x
94668   cycles - 10 (112) 3: SSE
-- unaligned strings --
996477  cycles - 10 (  0) 0: crt_strrchr
997094  cycles - 10 ( 40) 1: strrchr
298219  cycles - 10 (154) 2: x
121529  cycles - 10 (112) 3: SSE
-- small strings 128 --
324263  cycles - 500 (  0) 0: crt_strrchr
323710  cycles - 500 ( 40) 1: strrchr
84786   cycles - 500 (154) 2: x
34915   cycles - 500 (112) 3: SSE
-- small strings 1 --
67914   cycles - 500 (  0) 0: crt_strrchr
67286   cycles - 500 ( 40) 1: strrchr
12595   cycles - 500 (154) 2: x
16622   cycles - 500 (112) 3: SSE

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #57 on: July 27, 2014, 10:40:16 AM »
Hi nidud,

here's the output of auto.zip:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #58 on: July 27, 2014, 11:12:01 PM »
ok, that's almost perfect   :P

the speed of short strings seems to break even between 1000 and 2000 bytes
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
mov eax,edi ; return value

cmp ecx,1500 ; use SSE on short strings
jb SSE2
bt sselevel,SSEBT_AVX
jnc SSE2

rep movsb
ret

align 4
SSE2:

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #59 on: August 11, 2014, 08:46:40 PM »
I implemented some of the SSE function in the library and did some testing. Most (if not all) of them failed the regression test so a few adjustments needed to be made. The problem with copying memory in this case was overlapping strings and alignment of the pointers head and tail bytes. Using 16 byte blocks may create an overlap of 31 byte for alignment and tail bytes and not 15 as assumed. The handling of the tail bytes also needed a fixup:
Code: [Select]
; wrong:
movq xmm0,[esi] ; move 8..15 byte
movq [edi],xmm0 ; |8...|
movq xmm0,[esi+ecx-8] ; |...8|
movq [edi+ecx-8],xmm0
; correct:
movq xmm0,[esi] ; move 8..15 byte
movq xmm1,[esi+ecx-8] ; |...8|
movq [edi],xmm0 ; |8...|
movq [edi+ecx-8],xmm1

However, the solution was more or less equally fast as the first one, and now it also handle overlapping copy(m,m+1) and (m+1,m). memmove is now equ <memcpy> and this is the version I ended up using:
Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

ifdef __SSE__

strcpy  proc dst:ptr byte, src:ptr byte
mov ecx,esp
push [ecx]
mov eax,[ecx+4]
mov [ecx],eax
mov eax,[ecx+8]
mov [ecx+4],eax
xorps xmm1,xmm1
@@: movdqu  xmm0,[eax]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
add eax,16
test ecx,ecx
jz @B
bsf ecx,ecx
sub eax,[esp+8]
lea eax,[eax+ecx-15]
mov [esp+12],eax
strcpy  endp

endif

memcpy  proc dst, src, count
push esi
push edi
mov edi,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
ifdef __SSE__
 ifdef __AVX__
cmp ecx,1500
jb SSE2
bt sselevel,SSEBT_AVX
jnc SSE2
mov eax,edi
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
align 4
 @@: lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
align 4
 SSE2:
 endif
movdqu  xmm2,[esi] ; save align bytes
test ecx,-32 ; need 31 byte for overlap..
jz tail
push edx
movdqu  xmm3,[esi+ecx-16] ; save tail bytes
mov eax,esi ; align ESI 16
neg eax
and eax,1111B
mov edx,esi
sub edx,edi
cmp edx,ecx
mov edx,ecx ; save count in EDX
jbe overlapped
sub ecx,eax
add esi,eax
xchg edi,eax
add edi,eax ; return address to EAX
and ecx,-16 ; align ECX 16
align 4
@@: sub ecx,16
movdqa  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
jnz @B
movdqu  [eax],xmm2 ; fix tail and aligned bytes
movdqu  [eax+edx-16],xmm3
pop edx
pop edi
pop esi
ret 12
align 4
overlapped:
sub ecx,eax
and ecx,-16 ; align ECX 16
add eax,ecx
add esi,eax
xchg edi,eax
add edi,eax ; return address to EAX
neg ecx
align 4
@@: movdqa  xmm0,[esi+ecx]
movdqu  [edi+ecx],xmm0
add ecx,16
jnz @B
movdqu  [eax],xmm2 ; fix tail and aligned bytes
movdqu  [eax+edx-16],xmm3
pop edx
pop edi
pop esi
ret 12
align 4
tail: test ecx,ecx
jz toend ; 0
test ecx,-2
jz @1 ; 1
test ecx,-4
jz @2 ; 2..3
test ecx,-8
jz @4 ; 4..7
test ecx,-16
jz @8 ; 8..15
movdqu  xmm1,[esi+ecx-16]
movdqu  [edi],xmm2 ; 16..31
movdqu  [edi+ecx-16],xmm1
align 4
toend:  mov eax,edi
pop edi
pop esi
ret 12
align 4
@8: movq xmm1,[esi+ecx-8]
movq [edi],xmm2 ; 8..15
movq [edi+ecx-8],xmm1
jmp toend
align 4
@4: mov eax,[esi]
mov esi,[esi+ecx-4]
mov [edi],eax
mov [edi+ecx-4],esi
jmp toend
align 4
@2: mov eax,[esi]
mov [edi],ax
cmp ecx,3
jb toend
shr eax,16
mov [edi+2],al
jmp toend
align 4
@1: mov al,[esi]
mov [edi],al
jmp toend
else
mov eax,edi
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
align 4
 @@: lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
endif
memcpy  endp

This enable a simple way of inserting or exchange text in a buffer:
Code: [Select]
strcpy(head+length,tail) ; make room for new string
memcpy(head,string,length) ; insert new string

For search and replace ("%PATH%", "C:\MASM32\BIN") head is result from strstri(), tail is head+sizeof("%PATH%"), and head+length is > tail.

Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

strstri proc dst:ptr byte, src:ptr byte

mov eax,esp
push edx
push ebx
push edi
mov edx,[eax+4]
mov ebx,[eax+8]

movzx eax,byte ptr [ebx]
or al,20h
mov ah,al
mov ecx,eax
shl eax,16
add eax,ecx
movd xmm2,eax
pshufd  xmm2,xmm2,0 ; populate char
mov eax,20202020h
movd xmm3,eax
pshufd  xmm3,xmm3,0 ; populate 20h for case
xorps xmm4,xmm4 ; clear xmm2 for compare

align 4
lup: movdqu  xmm0,[edx]
movdqa  xmm1,xmm0
pcmpeqb xmm1,xmm4 ; test for zero
pmovmskb ecx,xmm1
orps xmm0,xmm3 ; remove case..
pcmpeqb xmm0,xmm2 ; test for char
pmovmskb eax,xmm0
lea edx,[edx+16]
or ecx,eax
jz lup

bsf ecx,ecx
lea edx,[edx+ecx-16]
cmp byte ptr [edx],0
je not_found

xor ecx,ecx
lea eax,[ebx+1]
mov edi,edx
inc edx

align 4
lup2: xor cl,[eax]
jz found
mov ch,[edx]
or cx,2020h
sub cl,ch
jnz lup
inc eax
inc edx
jmp lup2

align 4
not_found:
xor eax,eax
jmp toend

align 4
found:  mov ecx,eax
mov eax,edi
sub ecx,ebx
test eax,eax

align 4
toend:
pop edi
pop ebx
pop edx
ret 8
strstri  endp

I added a System Information box to show the SSE level using Dave's test, and on my AMD I get this information:
Code: [Select]
  │        Streaming SIMD Extensions: [x] SSE    [ ] SSE4.1 │
  │                                   [x] SSE2   [ ] SSE4.2 │
  │                                   [x] SSE3   [ ] AVX    │
  │        [ ] AVX supported by OS    [ ] SSSE3  [ ] AVX2   │
  └─────────────────────────────────────────────────────────┘

However, on a Intel i3 I get this:
Code: [Select]
  │        Streaming SIMD Extensions: [x] SSE    [x] SSE4.1 │
  │                                   [x] SSE2   [ ] SSE4.2 │
  │                                   [ ] SSE3   [ ] AVX    │
  │        [ ] AVX supported by OS    [ ] SSSE3  [ ] AVX2   │
  └─────────────────────────────────────────────────────────┘

Is this possible ? to have SSE4.1 and not SSE3 ?

Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.