Author Topic: Optimizing some code  (Read 17610 times)

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #15 on: June 11, 2014, 11:46:09 AM »
the routines we have analyzed before have typically been optimized for longer strings
for shorter strings, say 50 characters or less, those routines have too much overhead

so, i tried to keep overhead to a minimum
and - access data as 4-aligned dwords
it hasn't been tested, but maybe it will give you some ideas....
Code: [Select]
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

ShortLen PROC _lpStr:LPSTR

        mov     edx,[esp+4]
        test    dl,3
        mov     ecx,[edx]
        jz      start_dwords3

        xor     eax,eax
        inc     edx
        test    cl,cl
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ch,ch
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ecx,0FF0000h
        jz      all_done

        jmp short start_dwords2

start_dwords1:
        add     edx,4

start_dwords2:
        mov     ecx,[edx]

start_dwords3:
        test    cl,cl
        mov     al,0
        jz      end_dwords

        test    ch,ch
        mov     al,1
        jz      end_dwords

        test    ecx,0FF0000h
        mov     al,2
        jz      end_dwords

        test    ecx,0FF000000h
        mov     al,3
        jnz     start_dwords1

end_dwords:
        sub     eax,[esp+4]
        add     eax,edx

all_done:
        ret     4

ShortLen ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef
« Last Edit: June 11, 2014, 02:55:04 PM by dedndave »

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7642
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Optimizing some code
« Reply #16 on: June 11, 2014, 01:11:27 PM »
This very simple algo is still one of my favourites, mainly because its small, simple and flexible enough to inline it into the middle of a larger more complex algo. The aligned DWORD read versions will always be faster on longer linear reads but often its of no real gain in a more complex set of tasks.


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

slen proc ptxt:DWORD

    mov eax, [esp+4]
    sub eax, 1

  lbl0:
    add eax, 1
    cmp BYTE PTR [eax], 0
    jne lbl0

    sub eax, [esp+4]

    ret 4

slen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #17 on: June 11, 2014, 05:33:57 PM »
Well, well... so here are some brandnew algos, tested on 30 and 100 byte strings 8)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11619   cycles for 100 * Rui
3512    cycles for 100 * MB
4734    cycles for 100 * Masm32
4036    cycles for 100 * CRT
11969   cycles for 100 * Habran
4930    cycles for 100 * ShortLen (Dave)
11413   cycles for 100 * slen (Hutch)

11615   cycles for 100 * Rui
3511    cycles for 100 * MB
5915    cycles for 100 * Masm32
4034    cycles for 100 * CRT
11841   cycles for 100 * Habran
4932    cycles for 100 * ShortLen (Dave)
11412   cycles for 100 * slen (Hutch)

32727   cycles for 100 * Rui
5932    cycles for 100 * MB
14423   cycles for 100 * Masm32
13117   cycles for 100 * CRT
33202   cycles for 100 * Habran
15347   cycles for 100 * ShortLen (Dave)
32430   cycles for 100 * slen (Hutch)

32743   cycles for 100 * Rui
5515    cycles for 100 * MB
14442   cycles for 100 * Masm32
13123   cycles for 100 * CRT
33367   cycles for 100 * Habran
15324   cycles for 100 * ShortLen (Dave)
32421   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #18 on: June 11, 2014, 07:11:20 PM »
The string length topic was beaten to death. Anyway, here is it again:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

8547    cycles for 100 * Rui
1234    cycles for 100 * MB
4607    cycles for 100 * Masm32
2826    cycles for 100 * CRT
6069    cycles for 100 * Habran
3855    cycles for 100 * ShortLen (Dave)
8492    cycles for 100 * slen (Hutch)

8494    cycles for 100 * Rui
1233    cycles for 100 * MB
4609    cycles for 100 * Masm32
2812    cycles for 100 * CRT
6042    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
8486    cycles for 100 * slen (Hutch)

18892   cycles for 100 * Rui
2545    cycles for 100 * MB
9483    cycles for 100 * Masm32
8950    cycles for 100 * CRT
20342   cycles for 100 * Habran
12571   cycles for 100 * ShortLen (Dave)
19062   cycles for 100 * slen (Hutch)

18737   cycles for 100 * Rui
2545    cycles for 100 * MB
9462    cycles for 100 * Masm32
7726    cycles for 100 * CRT
20296   cycles for 100 * Habran
11327   cycles for 100 * ShortLen (Dave)
19168   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #19 on: June 11, 2014, 08:44:13 PM »
Quote
Hutch:
This very simple algo is still one of my favourites, mainly because its small,
simple and flexible enough...

In DOS we used junk like rep scasb.
But i have that kind of instructions in the dustbin.
Now i avoid to move the addresses.
I like to use the index (=length, size,...) and
a minimum number of registers.
In any way, for me, strlen32 or strlen32M is good.

Remember that, in many applications, we need to
remove input spaces or to find CR/LF and then we
get the length. So this is more important than
one to get the length only. For me, it is.
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

12473   cycles for 100 * Rui
3762    cycles for 100 * MB
12026   cycles for 100 * Masm32
5466    cycles for 100 * CRT
14959   cycles for 100 * Habran
11102   cycles for 100 * ShortLen (Dave)
12101   cycles for 100 * slen (Hutch)

12692   cycles for 100 * Rui
3699    cycles for 100 * MB
11994   cycles for 100 * Masm32
5172    cycles for 100 * CRT
15321   cycles for 100 * Habran
11575   cycles for 100 * ShortLen (Dave)
12105   cycles for 100 * slen (Hutch)

28829   cycles for 100 * Rui
7346    cycles for 100 * MB
26394   cycles for 100 * Masm32
15833   cycles for 100 * CRT
35733   cycles for 100 * Habran
27929   cycles for 100 * ShortLen (Dave)
27561   cycles for 100 * slen (Hutch)

27941   cycles for 100 * Rui
7318    cycles for 100 * MB
26319   cycles for 100 * Masm32
15981   cycles for 100 * CRT
36253   cycles for 100 * Habran
24471   cycles for 100 * ShortLen (Dave)
27507   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---
« Last Edit: June 12, 2014, 08:19:09 AM by RuiLoureiro »

nidud

  • Member
  • *****
  • Posts: 2001
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #20 on: June 11, 2014, 08:48:19 PM »
new test:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
712 cycles - std
910 cycles - Rui
671 cycles - habran
533 cycles - Dave
674 cycles - Hutch
593 cycles - JJ

6598 cycles - std
9789 cycles - Rui
6602 cycles - habran
4985 cycles - Dave
6589 cycles - Hutch
4011 cycles - JJ

Dave's function fails (3, 700)?:
Code: [Select]
0000000A != 3 - STRLEN error
000003BC != 700 - STRLEN error

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #21 on: June 11, 2014, 08:59:22 PM »
Jochen,
         Sorry, i think that i misunderstood your question.
         HERE are strlen32, strlen32M

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32    proc        pBuf:DWORD
            push        ebx           
            mov         ecx, [esp+8]           ; get pointer to string
            mov         eax, ecx               ; copy pointer
            and         ecx, 3                 ; lower 2 bits of address, check alignment
            jz          L2                     ; string is aligned by 4. Go to loop           
            and         eax, -4                ; align pointer by 4
            mov         ebx, [eax]             ; read from nearest preceding boundary
            shl         ecx, 3                 ; mul by 8 = displacement in bits
            mov         edx, -1
            shl         edx, cl                ; make byte mask
            not         edx                    ; mask = 0FFH for false bytes
            or          ebx, edx               ; mask out false bytes
            ; check first four bytes for zero
            ;-----------------------------------
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jnz         L3                     ; zero-byte found       
            ; Main loop, read 4 bytes aligned
            ;-----------------------------------
    L1:     add         eax, 4                 ; increment pointer by 4
    L2:     mov         ebx, [eax]             ; read 4 bytes of string
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jz          L1                     ; no zero bytes, continue loop       
    L3:     bsf         ecx, ecx               ; find right-most 1-bit
            shr         ecx, 3                 ; divide by 8 = byte index           
            sub         eax, [esp+8]           ; subtract start address
            add         eax, ecx               ; add index to byte           
            pop         ebx
            ret         4
strlen32    endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;******************************************************************************
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32M       proc        pBuf:DWORD           
                mov         ecx, [esp+4]           ; get pointer to string
                mov         eax, ecx               ; copy pointer
                and         ecx, 3                 ; lower 2 bits of address, check alignment
                jz          L2                     ; string is aligned by 4. Go to loop           
                and         eax, -4                ; align pointer by 4               
                shl         ecx, 3                 ; mul by 8 = displacement in bits
                mov         edx, -1
                shl         edx, cl                ; make byte mask
                not         edx                    ; mask = 0FFH for false bytes
                or          edx, [eax]             ; read from nearest preceding boundary         
                ; check first four bytes for zero
                ;-----------------------------------
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jnz         L3                     ; zero-byte found       
                ; Main loop, read 4 bytes aligned
                ;-----------------------------------
        L1:     add         eax, 4                 ; increment pointer by 4
        L2:     mov         edx, [eax]             ; read 4 bytes of string
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jz          L1                     ; no zero bytes, continue loop       
        L3:     bsf         ecx, ecx               ; find right-most 1-bit
                shr         ecx, 3                 ; divide by 8 = byte index           
                sub         eax, [esp+4]           ; subtract start address
                add         eax, ecx               ; add index to byte           
                ret         4
strlen32M       endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #22 on: June 11, 2014, 09:15:57 PM »
Jochen,
         Sorry, i think that i misunderstood your question.
         HERE are strlen32, strlen32M

OK, integrated in attached testbed. They are fast indeed.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11620   cycles for 100 * Rui
3512    cycles for 100 * MB
4338    cycles for 100 * strlen32
4755    cycles for 100 * strlen32M
11842   cycles for 100 * Habran
4932    cycles for 100 * ShortLen (Dave)
11417   cycles for 100 * slen (Hutch)

11628   cycles for 100 * Rui
3514    cycles for 100 * MB
4331    cycles for 100 * strlen32
4736    cycles for 100 * strlen32M
11842   cycles for 100 * Habran
4934    cycles for 100 * ShortLen (Dave)
11425   cycles for 100 * slen (Hutch)

32747   cycles for 100 * Rui
5516    cycles for 100 * MB
11223   cycles for 100 * strlen32
13623   cycles for 100 * strlen32M
33197   cycles for 100 * Habran
15323   cycles for 100 * ShortLen (Dave)
32441   cycles for 100 * slen (Hutch)

32974   cycles for 100 * Rui
5515    cycles for 100 * MB
11256   cycles for 100 * strlen32
13629   cycles for 100 * strlen32M
33159   cycles for 100 * Habran
15327   cycles for 100 * ShortLen (Dave)
32436   cycles for 100 * slen (Hutch)

Dave's function fails (3, 700)?:

Change
ShortLen PROC _lpStr:LPSTR
to
ShortLen PROC ; _lpStr:LPSTR

nidud

  • Member
  • *****
  • Posts: 2001
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #23 on: June 11, 2014, 09:26:44 PM »
fast  :t

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
712   cycles - std
910   cycles - Rui
671   cycles - habran
521   cycles - Dave
663   cycles - Hutch
561   cycles - JJ
445   cycles - Rui32
462   cycles - Rui32M

6623   cycles - std
9859   cycles - Rui
6627   cycles - habran
4242   cycles - Dave
6627   cycles - Hutch
4029   cycles - JJ
2651   cycles - Rui32
2823   cycles - Rui32M

Code: [Select]
00000301 != 1 - Dave: STRLEN error
0000000A != 3 - Dave: STRLEN error
000003BC != 700 - Dave: STRLEN error

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #24 on: June 11, 2014, 09:32:34 PM »
fast  :t
...
561   cycles - JJ

Too slow for my taste, but changing
   void len(string)
to
   void Len(string)
helps a lot ;-)

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #25 on: June 11, 2014, 09:35:28 PM »
You need to write a simple algo to sort it, Jochen.
Where is your algo ? I bet it uses rep scasb !

This table is sorted.
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

 3704    cycles for 100 * MB
 5093    cycles for 100 * strlen32M
 5349    cycles for 100 * strlen32

11087   cycles for 100 * ShortLen (Dave)
12093   cycles for 100 * slen (Hutch)
12549   cycles for 100 * Rui
15481   cycles for 100 * Habran
 3742   cycles for 100 * MB
 5332   cycles for 100 * strlen32
 5462   cycles for 100 * strlen32M

11163   cycles for 100 * ShortLen (Dave)
12060   cycles for 100 * slen (Hutch)
12775   cycles for 100 * Rui
14858   cycles for 100 * Habran

 7334   cycles for 100 * MB
17334   cycles for 100 * strlen32M
17613   cycles for 100 * strlen32

27930   cycles for 100 * slen (Hutch)
28026   cycles for 100 * ShortLen (Dave)
28658   cycles for 100 * Rui
43521   cycles for 100 * Habran

 7344   cycles for 100 * MB
17316   cycles for 100 * strlen32M
17779   cycles for 100 * strlen32

24022   cycles for 100 * ShortLen (Dave)
27805   cycles for 100 * slen (Hutch)
28074   cycles for 100 * Rui
38015   cycles for 100 * Habran

100     = eax Rui
100     = eax MB
100     = eax strlen32
100     = eax strlen32M
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #26 on: June 11, 2014, 09:39:49 PM »
 :biggrin:
: we have optimizations and ... optimizations !
 for all tastes ! (we see it not only here)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
2516    cycles - std
1128    cycles - Rui
1297    cycles - habran
1048    cycles - Dave
1443    cycles - Hutch
1200    cycles - JJ
  839    cycles - Rui32
  841    cycles - Rui32M

13747   cycles - std
  7607    cycles - Rui
  9441    cycles - habran
  6250    cycles - Dave
  7470    cycles - Hutch
  7079    cycles - JJ
  4154    cycles - Rui32
  4137    cycles - Rui32M

--- ok ---
« Last Edit: June 12, 2014, 08:21:12 AM by RuiLoureiro »

nidud

  • Member
  • *****
  • Posts: 2001
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #27 on: June 11, 2014, 10:00:55 PM »
yep, that was better  :t
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
712   cycles - std
909   cycles - Rui
671   cycles - habran
544   cycles - Dave
664   cycles - Hutch
606   cycles - JJ
444   cycles - Rui32
440   cycles - Rui32M
308   cycles - JJ2

6598   cycles - std
10244   cycles - Rui
6590   cycles - habran
5022   cycles - Dave
6576   cycles - Hutch
4019   cycles - JJ
2641   cycles - Rui32
2792   cycles - Rui32M
1049   cycles - JJ2

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #28 on: June 11, 2014, 11:07:08 PM »

Quote
yep, that was better
What ? Did you see it ? Where is it ? Show us.
I never saw, i never read any proc written by Jochen.
We should compare only comparable things.


This is what we may see:
for my taste this is junk
Quote
align 16
TestA_s:
NameA equ Rui   ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   push offset somestring
   call GetStringLenY   ; Rui
   dec ebx
  .Until Sign?
  ret
TestA endp
TestA_endp:
; ---------------------------------------------------
align 16
TestB_s:
NameB equ MB   ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   void Len(offset somestring)
   dec ebx
  .Until Sign?
  ret
TestB endp
TestB_endp:
align 16
TestC_s:
; ----------------------------------------------------
; useC=0      ; uncomment to exclude TestC
NameC equ <strlen32>   ; assign a descriptive name here
TestC proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   invoke strlen32, offset somestring
   dec ebx
  .Until Sign?
  ret
TestC endp
TestC_endp:
« Last Edit: June 12, 2014, 03:16:13 AM by RuiLoureiro »

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #29 on: June 11, 2014, 11:09:02 PM »
Dave's function fails (3, 700)?:
Code: [Select]
0000000A != 3 - STRLEN error
000003BC != 700 - STRLEN error

i don't understand what that means
i'll fix it if you like