Author Topic: Instr, strstr, find$  (Read 15218 times)

nidud

  • Member
  • *****
  • Posts: 1528
    • https://github.com/nidud/asmc
Re: Instr, strstr, find$
« Reply #15 on: July 15, 2014, 09:22:15 PM »
here is a simple one, 44 byte
Code: [Select]
strstr  proc dst, src
@1:     mov     ecx,src
        mov     edx,dst
@2:     xor     eax,eax
        xor     al,[edx]
        jz      @5
        inc     edx
        sub     al,[ecx]
        jnz     @2
        mov     dst,edx
        inc     ecx
@3:     xor     al,[ecx]
        jz      @4
        sub     al,[edx]
        jnz     @1
        inc     ecx
        inc     edx
        jmp     @3
@4:     mov     eax,dst
        dec     eax
@5:     ret
strstr  endp

jj2007

  • Member
  • *****
  • Posts: 8507
  • Assembler is fun ;-)
    • MasmBasic
Re: Instr, strstr, find$
« Reply #16 on: July 16, 2014, 03:31:34 AM »
Quite efficient :t
However,

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
3296    cycles for 10 * MbInstr 0 (A)
4000    cycles for 10 * MbInstr 0 (B)
5169    cycles for 10 * crt_strstr
5519    cycles for 10 * M32 find$

but

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
3540    cycles for 10 * MbInstr 0 (A)
3999    cycles for 10 * MbInstr 0 (B)
5159    cycles for 10 * crt_strstr
5335    cycles for 10 * M32 find$
3247    cycles for 10 * strstr_nidud


What did you smuggle in that destroys the performance of my algo??? :eusa_naughty:

What is even more worrying is the bad performance on my Celeron :eusa_boohoo:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
4094    cycles for 10 * MbInstr 0 (A)
3291    cycles for 10 * crt_strstr
3778    cycles for 10 * M32 find$
3347    cycles for 10 * strstr_nidud
« Last Edit: July 16, 2014, 04:53:52 AM by jj2007 »

nidud

  • Member
  • *****
  • Posts: 1528
    • https://github.com/nidud/asmc
Re: Instr, strstr, find$
« Reply #17 on: July 16, 2014, 04:51:11 AM »
What did you smuggle in that destroys the performance of my algo??? :eusa_naughty:

 :biggrin:

I think the new function is "stealing" (allocating) cache from the other functions, so the problem is you got a fake (good) result before. The more similar functions you have in the binary the faster it get.

498415  cycles - 50 (62) a 1: strstr
478468  cycles - 50 (66) a 3: x
428413  cycles - 50 (54) a 6: x
428425  cycles - 50 (48) a 7: x

if other functions are added (and not used):

481651  cycles - 50 (62) a 1: strstr
465114  cycles - 50 (66) a 3: x
632872  cycles - 50 (54) a 6: x
629542  cycles - 50 (48) a 7: x


so it’s difficult now to know which one to choose...

dedndave

  • Member
  • *****
  • Posts: 8789
  • Still using Abacus 2.0
    • DednDave
Re: Instr, strstr, find$
« Reply #18 on: July 16, 2014, 05:17:21 AM »
put them in seperate programs - run a batch file   :P

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Instr, strstr, find$
« Reply #19 on: July 16, 2014, 06:00:48 AM »
put them in seperate programs - run a batch file   :P

Not a bad idea.

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1528
    • https://github.com/nidud/asmc
Re: Instr, strstr, find$
« Reply #20 on: July 16, 2014, 06:20:58 AM »
function to test:
Code: [Select]
.486
.model flat,stdcall
.code
proc_4 proc string
mov eax,string
sub eax,4
@@: add eax,4
mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080H
jz @B
bsf ecx,ecx
shr ecx,3
sub eax,string
add eax,ecx
mov ecx,eax
ret
proc_4  endp

end

jwasm -bin proc_4.asm

Code: [Select]
flag_0  equ 0
flag_1  equ 0
flag_2  equ 0
flag_3  equ 1
flag_4  equ 1
flag_5  equ 1
flag_6  equ 1

proc_0  equ crt_strlen
proc_1  equ szLen
proc_2  equ MbStrLen
proc_3  db "strlen\proc_3.bin",0
proc_4  db "strlen\proc_4.bin",0
proc_5  db "strlen\proc_5.bin",0
proc_6  db "strlen\proc_6.bin",0

info_0  db "crt_strlen",0
info_1  db "MASM32 - szLen() ",0
info_2  db "MasmBasic MbStrLen Len() - SSE",0
...

pushargs macro
push str_x
endm

; get the cycle count for each algo

test_algo macro x, loopcount
if flag_&x&
lea edx,proc_&x&
else
mov edx,proc_&x&
endif
invoke timeit,edx,0,flag_&x&,loopcount,x,addr info_&x&
endm

.code

readit proc uses ebx edi fname
sub edi,edi
invoke CreateFile,fname,80000000h,0,0,3,0,0
.if eax != -1
    mov ebx,eax
    push 0
    mov edx,esp ; lpNumberOfBytesRead
    invoke ReadFile,ebx,addr proc_x,4096,edx,0
    test eax,eax
    pop edi
    .if ZERO?
sub edi,edi
    .endif
    invoke CloseHandle,ebx
.endif
mov eax,edi
ret
readit endp

timeit proc uses ebx esi edi p,plen,ptype,count,id,info
.if ptype
    invoke readit,p
    mov plen,eax
    lea esi,proc_x
    test eax,eax
    jz error1
.else
     mov esi,p
.endif
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
.while edi
    pushargs
    call esi
    mov esp,ebx
    dec edi
.endw
counter_end
printf("%d\tcycles - %d (%3d) %d: %s\n",eax,count,plen,id,info)
toend:
ret
error1:
printf("error reading %s\n",info)
jmp toend
timeit endp

151336  cycles - 50 (  0) 0: crt_strlen
244234  cycles - 50 (  0) 1: MASM32 - szLen()
47350   cycles - 50 (  0) 2: MasmBasic MbStrLen Len() - SSE
157571  cycles - 50 ( 86) 3: AgnerFog
140117  cycles - 50 ( 49) 4: AgnerFog unaligned
340870  cycles - 50 ( 94) 5: Dave
38961   cycles - 50 ( 45) 6: strlen SSE

jj2007

  • Member
  • *****
  • Posts: 8507
  • Assembler is fun ;-)
    • MasmBasic
Re: Instr, strstr, find$
« Reply #21 on: July 16, 2014, 06:29:03 AM »
Here is Instr() with another setting: find echo WARNING in WinExtra.inc

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
33284   kCycles for 10 * MbInstr 0 (zero-delimited)
30100   kCycles for 10 * MbInstr 0 (file size)
34983   kCycles for 10 * crt_strstr
37981   kCycles for 10 * M32 find$
38525   kCycles for 10 * strstr_nidud


The second entry (file size) refers to the additional parameter mentioned above: The function knows how many bytes are available in the source string. The difference is surprisingly low, though - I had expected a stronger influence of the data cache.


function to test:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
10568   cycles for 100 * proc_4
3939    cycles for 100 * Len
13860   cycles for 100 * len

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Instr, strstr, find$
« Reply #22 on: July 16, 2014, 08:24:24 AM »
Jochen,

that's what InstrTimingsNew did:


Here is the output of proc4:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

6860    cycles for 100 * proc_4
1978    cycles for 100 * Len
9348    cycles for 100 * len

6205    cycles for 100 * proc_4
1957    cycles for 100 * Len
9312    cycles for 100 * len

6224    cycles for 100 * proc_4
1955    cycles for 100 * Len
9924    cycles for 100 * len

100     = eax proc_4
100     = eax Len
100     = eax len

--- ok ---

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 8507
  • Assembler is fun ;-)
    • MasmBasic
Re: Instr, strstr, find$
« Reply #23 on: July 16, 2014, 08:41:00 AM »
that's what InstrTimingsNew did:

Gunther,
Either you have no \Masm32\include\winextra.inc (unlikely), or you launched the exe from a different drive than your Masm32 drive.

nidud

  • Member
  • *****
  • Posts: 1528
    • https://github.com/nidud/asmc
Re: Instr, strstr, find$
« Reply #24 on: July 16, 2014, 08:53:39 AM »
here is the aligned version
Code: [Select]
align 16
strstr  proc dst, src
nop
@1: mov ecx,src ; 04h
mov edx,dst
xor eax,eax ; fill code..
cmp [ecx],al
je @5
@2: xor eax,eax ; 10h
xor al,[edx]
jz @5
lea edx,[edx+1]
sub al,[ecx]
jnz @2
mov dst,edx
lea ecx,[ecx+1]
nop
@3: xor al,[ecx] ; 24h
jz @4
sub al,[edx]
jnz @1
inc ecx
inc edx
jmp @3
@4: mov eax,dst ; 30h
dec eax
@5: ret ; 34h
strstr  endp

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Instr, strstr, find$
« Reply #25 on: July 16, 2014, 08:58:45 AM »
Jochen,

I've the winextra.inc, but I fired up the application from a different drive. Here is the new output:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++11 of 20 tests valid, loop overhead is approx. 43/10 cycles

20184   kCycles for 10 * MbInstr 0 (zero-delimited)
18778   kCycles for 10 * MbInstr 0 (file size)
22744   kCycles for 10 * crt_strstr
28534   kCycles for 10 * M32 find$
30970   kCycles for 10 * strstr_nidud

19942   kCycles for 10 * MbInstr 0 (zero-delimited)
18751   kCycles for 10 * MbInstr 0 (file size)
22837   kCycles for 10 * crt_strstr
28442   kCycles for 10 * M32 find$
30908   kCycles for 10 * strstr_nidud

20094   kCycles for 10 * MbInstr 0 (zero-delimited)
18727   kCycles for 10 * MbInstr 0 (file size)
22739   kCycles for 10 * crt_strstr
28537   kCycles for 10 * M32 find$
30943   kCycles for 10 * strstr_nidud

1068448 = eax MbInstr 0 (zero-delimited)
1068448 = eax MbInstr 0 (file size)
1068448 = eax crt_strstr
1068448 = eax M32 find$
1068448 = eax strstr_nidud

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1528
    • https://github.com/nidud/asmc
Re: Instr, strstr, find$
« Reply #26 on: July 16, 2014, 09:14:31 PM »
here is the strlen test from the new template
the unaligned version (4) was faster in the old test

AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
151616  cycles - 50 (  0) 0: crt_strlen
244109  cycles - 50 (  0) 1: MASM32 - szLen()
47645   cycles - 50 (  0) 2: MasmBasic MbStrLen Len() - SSE
130962  cycles - 50 ( 86) 3: AgnerFog
156628  cycles - 50 ( 49) 4: AgnerFog unaligned
340432  cycles - 50 ( 94) 5: Dave
40714   cycles - 50 ( 45) 6: strlen SSE


here is the strstr test

24255   kCycles for 10 * MbInstr 0 (zero-delimited)
21810   kCycles for 10 * MbInstr 0 (file size)
27030   kCycles for 10 * crt_strstr
54318   kCycles for 10 * M32 find$
36887   kCycles for 10 * strstr_nidud


and with the aligned version

23684   kCycles for 10 * MbInstr 0 (zero-delimited)
21400   kCycles for 10 * MbInstr 0 (file size)
26562   kCycles for 10 * crt_strstr
52755   kCycles for 10 * M32 find$
27872   kCycles for 10 * strstr_nidud


after the changes I now get this result

513228  cycles - 50 ( 67) 1: strstr
465545  cycles - 50 ( 71) 2: x
426742  cycles - 50 ( 57) 4: x - new
628714  cycles - 50 ( 54) 5: x - old
628309  cycles - 50 ( 50) 7: x


compare to the (fake) first test

428413  cycles - 50 ( 54) 5: x
428425  cycles - 50 ( 50) 7: x

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 5484
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Instr, strstr, find$
« Reply #27 on: July 18, 2014, 03:05:45 PM »
With Dave's suggestion, try this before the timing of each algo in each separate test piece. Set the priority class high enough to avoid the wanders and see if this helps to stabilise the results.

Code: [Select]
    cpuid                           ; serialising instruction for wider seperation
    pause                           ; spinlock delay instruction

    invoke SleepEx,10,0

    cpuid                           ; serialising instruction for wider seperation
    pause                           ; spinlock delay instruction

Usually I have found that some algos are much more sensitive to code location than others, usually intensive BYTE operations where dealing in larger data types reduces the variation.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 8507
  • Assembler is fun ;-)
    • MasmBasic
Re: Instr, strstr, find$
« Reply #28 on: July 24, 2014, 10:17:31 AM »
I've added a fast variant of Instr(). At least on my CPUs, it looks competitive:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
33220   kCycles for 10 * MbInstr 0 (zero-delimited)
30094   kCycles for 10 * MbInstr 0 (file size)
8362    kCycles for 10 * MbInstr FAST
34858   kCycles for 10 * crt_strstr
38010   kCycles for 10 * M32 find$
38399   kCycles for 10 * strstr_nidud

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
32413   kCycles for 10 * MbInstr 0 (zero-delimited)
24107   kCycles for 10 * MbInstr 0 (file size)
13446   kCycles for 10 * MbInstr FAST
57954   kCycles for 10 * crt_strstr
58467   kCycles for 10 * M32 find$
38112   kCycles for 10 * strstr_nidud

Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
22035   kCycles for 10 * MbInstr 0 (zero-delimited)
19469   kCycles for 10 * MbInstr 0 (file size)
5340    kCycles for 10 * MbInstr FAST
27871   kCycles for 10 * crt_strstr
28522   kCycles for 10 * M32 find$
24944   kCycles for 10 * strstr_nidud


To assemble the source, you will need MasmBasic of today, 24 July.

Usage: Instr_(1, "Test", "Te", FAST)   ; 4 args, last one is uppercase FAST
This is always case-sensitive (same for find$, strstr etc).
« Last Edit: July 24, 2014, 06:09:24 PM by jj2007 »

johnsa

  • Member
  • ****
  • Posts: 682
    • Uasm
Re: Instr, strstr, find$
« Reply #29 on: July 24, 2014, 11:50:58 PM »

Timings from me...

Code: [Select]

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
++++++++12 of 20 tests valid, loop overhead is approx. 46/10 cycles

16446   kCycles for 10 * MbInstr 0 (zero-delimited)
15438   kCycles for 10 * MbInstr 0 (file size)
3842    kCycles for 10 * MbInstr FAST
26039   kCycles for 10 * crt_strstr
23588   kCycles for 10 * M32 find$
20650   kCycles for 10 * strstr_nidud

16556   kCycles for 10 * MbInstr 0 (zero-delimited)
15557   kCycles for 10 * MbInstr 0 (file size)
3839    kCycles for 10 * MbInstr FAST
25890   kCycles for 10 * crt_strstr
23566   kCycles for 10 * M32 find$
20681   kCycles for 10 * strstr_nidud

16534   kCycles for 10 * MbInstr 0 (zero-delimited)
15786   kCycles for 10 * MbInstr 0 (file size)
3788    kCycles for 10 * MbInstr FAST
25781   kCycles for 10 * crt_strstr
23581   kCycles for 10 * M32 find$
20714   kCycles for 10 * strstr_nidud

1068448 = eax MbInstr 0 (zero-delimited)
1068448 = eax MbInstr 0 (file size)
1068448 = eax MbInstr FAST
1068448 = eax crt_strstr
1068448 = eax M32 find$
1068448 = eax strstr_nidud