Author Topic: Slight modification to StrLen in the masm32 library - 9% speed increase  (Read 13127 times)

TWell

  • Member
  • ****
  • Posts: 748
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #30 on: September 14, 2015, 03:42:22 PM »
With AMD x64 prosessor
Something wrong with it?
Code: [Select]
  209 should be cycle count minus overhead       since function is empty should be zero or close to it
   34 should be cycle count minus overhead       since function is empty should be zero or close to it
 -180 should be cycle count minus overhead       since function is empty should be zero or close to it
  -52 should be cycle count minus overhead       since function is empty should be zero or close to it
 -814 should be cycle count minus overhead       since function is empty should be zero or close to it
   21 should be cycle count minus overhead       since function is empty should be zero or close to it
-2666 should be cycle count minus overhead       since function is empty should be zero or close to it
   24 should be cycle count minus overhead       since function is empty should be zero or close to it
   71 should be cycle count minus overhead       since function is empty should be zero or close to it
  -85 should be cycle count minus overhead       since function is empty should be zero or close to it
   81 should be cycle count minus overhead       since function is empty should be zero or close to it
    4 should be cycle count minus overhead       since function is empty should be zero or close to it
   39 should be cycle count minus overhead       since function is empty should be zero or close to it
  706 should be cycle count minus overhead       since function is empty should be zero or close to it
   24 should be cycle count minus overhead       since function is empty should be zero or close to it
-3367 should be cycle count minus overhead       since function is empty should be zero or close to it
   18 should be cycle count minus overhead       since function is empty should be zero or close to it
  110 should be cycle count minus overhead       since function is empty should be zero or close to it
   72 should be cycle count minus overhead       since function is empty should be zero or close to it
  123 should be cycle count minus overhead       since function is empty should be zero or close to it
Code: [Select]
-213
 -12
  12
 -12
  18
 354
  -9
 -45
  18
-303
  36
 -39
  27
  21
  39
 -15
-3564
  27
  15
 -12

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #31 on: September 14, 2015, 04:19:53 PM »
With AMD x64 prosessor
Something wrong with it?

The code in your attachment appears to be the code snippet from the Intel manual
on RDTSC posted earlier. You'd have to look at the manual to see if AMD is supported.

I make no claims regarding outside sources. It was only posted because I found it
to be of interest. That snippet is only to 'Initialize' a counter using RDTSC, if I had
read the documentation correctly, btw. I posted it for comparison to how MichaelW
initializes the counter macro.

Most of the work in this thread is *experimental*, btw and mostly for research.
Trying to find a better way to set the "LoopCount" for Michaels timer/counter macros.

As for the test you made with one of the testbeds I posted here, they are
a work-in-progress. Thank you for the input. :t

NO Guarantees are made as to the test pieces posted in this thread.

edit == clarification. and more clarification.
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #32 on: September 14, 2015, 04:57:10 PM »
I might just give up with the better counter pursuit.
My computer is at fault, or I should say my processor.

While browsing the threads here in the lab, I came across this thread:
http://masm32.com/board/index.php?topic=3161.10

And ran jj's test These are two results which are very different,
ran one after the other

First run

Code: [Select]
Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)

525     cycles for 100 * eax*5, lea
567     cycles for 100 * eax*5, imul
575     cycles for 100 * eax*7, lea
560     cycles for 100 * eax*7, imul
825     cycles for 100 * eax*7, mul

524     cycles for 100 * eax*5, lea
570     cycles for 100 * eax*5, imul
577     cycles for 100 * eax*7, lea
570     cycles for 100 * eax*7, imul
825     cycles for 100 * eax*7, mul

526     cycles for 100 * eax*5, lea
575     cycles for 100 * eax*5, imul
575     cycles for 100 * eax*7, lea
567     cycles for 100 * eax*7, imul
829     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

Second run

Code: [Select]
Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)

262     cycles for 100 * eax*5, lea
283     cycles for 100 * eax*5, imul
287     cycles for 100 * eax*7, lea
278     cycles for 100 * eax*7, imul
416     cycles for 100 * eax*7, mul

262     cycles for 100 * eax*5, lea
283     cycles for 100 * eax*5, imul
288     cycles for 100 * eax*7, lea
283     cycles for 100 * eax*7, imul
412     cycles for 100 * eax*7, mul

262     cycles for 100 * eax*5, lea
283     cycles for 100 * eax*5, imul
288     cycles for 100 * eax*7, lea
283     cycles for 100 * eax*7, imul
412     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

So, until I get a better computer, I will have to pass on any
further experiments and investigations regarding counters/timers.

Now I know why it was so hard to get a fix on the right LoopCount.  :(

zedd
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

jj2007

  • Member
  • *****
  • Posts: 10548
  • Assembler is fun ;-)
    • MasmBasic
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #33 on: September 14, 2015, 05:18:25 PM »
Try this version.

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #34 on: September 14, 2015, 05:29:13 PM »
Try this version.

ran it a dozen times...
Code: [Select]
Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)

260     cycles for 100 * eax*5, lea
286     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
276     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

263     cycles for 100 * eax*5, lea
282     cycles for 100 * eax*5, imul
286     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
412     cycles for 100 * eax*7, mul

263     cycles for 100 * eax*5, lea
282     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

The results of the others were way similar.

But still I think my 'puter is ^$&^$#^%#&^$*^$*

Hard to get good, consistent readings sometimes.
Not only from my own counters, but the ones from the macros, and
other members testbeds as well.

Timers are quite a bit better in general, at least for this aging laptop.
Probably has a cache of  < 1kb lol.

edit = typo
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

jj2007

  • Member
  • *****
  • Posts: 10548
  • Assembler is fun ;-)
    • MasmBasic
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #35 on: September 14, 2015, 07:20:54 PM »
Your last results look very consistent. Your cpu is OK, we must improve our timers, that's all.

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #36 on: September 14, 2015, 07:50:27 PM »
Your last results look very consistent. Your cpu is OK, we must improve our timers, that's all.

Get rid of P4 and it's offspring too. lol

I have read about the many problems with certain cpus and RDTSC. Mine for instance with
RDTSC, it always returns a multiple of 4. Making matters worse. I wouldn't want to simply
SHR 2, and find out I am wrong.

It's all good, I'm still in the learning process.

btw, with the timers my cpu is ok. It's the counters
that my cpu struggles with. I have been reading up on Intels CoreDuo
and NetBurst, and the problems related to accurate counts with RDTSC.

I'm not always the sharpest knife in the drawer, but I have my moments.  :P

zedd151

  • Member
  • ****
  • Posts: 871
I did it!!! --- well, sort of
« Reply #37 on: October 07, 2015, 06:55:18 AM »
I was experimenting wih one of my favorite guinea pigs again, and came across a couple of optimizations for StrLen in the masm32 library.

It increases speed by a whopping 33% or better for most strings.

There is a catch however, explained in the source code.

Code: [Select]

Testing zStrLen macro smaller string with a smaller string

64  = cycles
63  = cycles
63  = cycles
63  = cycles
63  = cycles
--------------
109  = result for validation

 Testing StrLen (from the masm32 library) with a smaller string

100  = cycles
100  = cycles
100  = cycles
100  = cycles
100  = cycles
--------------
109  = result for validation

 Testing zStrLen macro with a  larger string

639  = cycles
639  = cycles
640  = cycles
641  = cycles
643  = cycles
--------------
1149  = result for validation

 Testing StrLen (from the masm32 library) with a  larger string

1012  = cycles
1006  = cycles
1003  = cycles
1011  = cycles
1004  = cycles
--------------
1149  = result for validation



-- done! --



Here is the code:
Code: [Select]

            include \masm32\include\masm32rt.inc
        .686
        .xmm
            include \masm32\macros\timers.asm
           
; -------------------------------------------------------------------------
;   This macro is derived from the StrLen procedure in the masm32 library,
; it is 35-38% faster than StrLen, and accurate....
;
; ..There is a catch however, it will only count characters below 129d or
; 81h, which is fine in most instances - except for strings containing
; characters in the upper 80h bytes of the ascii table.
;
;   In the case where it does encounter a character from the upper ascii
; range, it treats it as a terminator, and stops counting at that point.
;
;   So for masm32 .asm files that use 0ABh (or others > 81h) in a comment-
; line for instance, it will return the wrong result. Sorry, hutch.
; -------------------------------------------------------------------------

; -------------------------------------------------------------------------
; Notes:
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Some instruction sequences have been reordered, as well as a couple have
; been removed (hence, no upper ascii). This was all done to create the
; dramatic speed increase shown in this example.
; Some registers also have been rearranged during the time I was trying to
; reduce register usage. It turned out the the macro still uses the same
; amount of registers, but some in a different fashion. All constants used
; are now in registers for faster access.
; -------------------------------------------------------------------------
; This example has been tested on data over 100MB, and retains its accuracy,
; but only when all the characters (in the text under test) are in the lower
; 80h bytes of the ascii table.
; -------------------------------------------------------------------------
; fyi result from 100MB test (file filled with random chars 1 to 80h):
; ~ 79,4xx,xxx  cycles zStrLen macro
; ~ 101,xxx,xxx cycles for StrLen.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; As you can see the speed increase has been reduced with this very large
; 'string'. But for most uses, the speed increase remains ~ 33% or so.
; -------------------------------------------------------------------------

; ---------------------------------
; Begin zedds zStrLen macro
;----------------------------------
    zStrLen macro item
    local nxt
    align 4
        push esi
        push ebx
       
        mov esi, 0FEFEFEFFh
        mov eax, item
        mov ebx, 80808080h
        push eax
    @@:                     ; discrete unrolling of loop - x 6 (best case)
        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jnz nxt

        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jnz nxt

        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jnz nxt

        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jnz nxt

        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jnz nxt

        mov edx, [eax]
        lea ecx, [edx+esi]
        add eax, 4
        and ecx, ebx
        jz @B
    nxt:
        test cx, bx
        jnz @F
        add eax, 2
        shr ecx, 16
    @@:
        pop edx
        add edx, 3
        shl cl, 1
        sbb eax, edx
        pop ebx
        pop esi
    endm
; ---------------------------------
; End zedds zStrLen macro
;----------------------------------

    .data
   
        align 4
        test1\
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "012345678", 0
       
        align 4
        test2\
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "01234567890123456789012345678901234567890123456789"
        db "98765432109876543210987654321098765432109876543210"
        db "0123456789012345678901234567890123456789012345678", 0


        ctrctr0     dd 0
        ctrctr1     dd 0
        ctrctr2     dd 0
        ctrctr3     dd 0
        ctrctr4     dd 0
    .code


        start:
       
    ;;  stabilizer  ;;
; ------------------------------------------------------------------------------
; ----  This section attempts to stabilize the counts
; ----  I prefer this method to using 'Sleep'.
; ------------------------------------------------------------------------------
        print "stabilizing ", 13, 10, 13, 10

    mov ctrctr0, 12
    ctop0:
        counter_begin 50000000, HIGH_PRIORITY_CLASS
        counter_end
        print "X"
    dec ctrctr0
    jnz ctop0
    cls
; ------------------------------------------------------------------------------
; ---- Testing begins here
; ------------------------------------------------------------------------------

        print " Testing zStrLen macro smaller string", 13, 10, 13, 10

    mov ctrctr1, 5
    ctop1:
       
        counter_begin 10000000, HIGH_PRIORITY_CLASS
        zStrLen offset test1
        counter_end
        print str$(eax), 20h, " = cycles", 13, 10 ; print result for each test
       
    dec ctrctr1
    jnz ctop1
       
        print "--------------", 13, 10
        zStrLen offset test1
       
        print str$(eax), 20h, " = result for validation", 13, 10, 13, 10
       
; ------------------------------------------------------------------------------
   
        print " Testing StrLen smaller string ", 13, 10, 13, 10
   
    mov ctrctr2, 5
    ctop2:
       
        counter_begin 10000000, HIGH_PRIORITY_CLASS
        invoke StrLen, offset test1
        counter_end
        print str$(eax), 20h, " = cycles", 13, 10 ; print result for each test
       
    dec ctrctr2
    jnz ctop2
   
        print "--------------", 13, 10
        invoke StrLen, offset test1
       
        print str$(eax), 20h, " = result for validation", 13, 10, 13, 10
       
; ------------------------------------------------------------------------------

        print " Testing zStrLen macro larger string ", 13, 10, 13, 10

    mov ctrctr3, 5
    ctop3:
       
        counter_begin 1000000, HIGH_PRIORITY_CLASS
        zStrLen offset test2
        counter_end
        print str$(eax), 20h, " = cycles", 13, 10 ; print result for each test
       
    dec ctrctr3
    jnz ctop3
       
        print "--------------", 13, 10
        zStrLen offset test2
       
        print str$(eax), 20h, " = result for validation", 13, 10, 13, 10
       
; ------------------------------------------------------------------------------
   
        print " Testing StrLen larger string ", 13, 10, 13, 10
   
    mov ctrctr4, 5
    ctop4:
       
        counter_begin 1000000, HIGH_PRIORITY_CLASS
        invoke StrLen, offset test2
        counter_end
        print str$(eax), 20h, " = cycles", 13, 10 ; print result for each test
       
    dec ctrctr4
    jnz ctop4
   
        print "--------------", 13, 10
        invoke StrLen, offset test2
       
        print str$(eax), 20h, " = result for validation", 13, 10, 13, 10
       
; ------------------------------------------------------------------------------
; ---- Testing ends here
; ------------------------------------------------------------------------------

        print chr$(13, 10)
        inkey chr$(13, 10, "-- done! --", 13)
        exit


        end start


« Last Edit: July 20, 2018, 04:09:23 AM by zedd151 »
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #38 on: October 07, 2015, 07:15:09 AM »
that's similar to Randy Hyde's algo - but he provided a way to work around the "catch"
i unrolled his algo - the result is that it's even faster for long strings
for shorter strings, the other algorithms are slightly better
because we most often call the routine for shorter strings, it only makes sense if you're doing several long ones

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #39 on: October 07, 2015, 07:20:06 AM »
that's similar to Randy Hyde's algo -

As long as it is not exactly the same as his :P

Where is it? (Or exactly what is the algo named so I can search for it) I'd like to take a look at it.
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #40 on: October 07, 2015, 07:25:14 AM »
Randy's original code

http://www.masmforum.com/board/index.php?topic=14626.msg132439#msg132439

i unrolled it a bit, by loading the constants into regs
this made the code smaller so that the loop could be longer, and still have a short branch at the end

http://www.masmforum.com/board/index.php?topic=18349.msg154921#msg154921

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #41 on: October 07, 2015, 07:27:58 AM »
what makes that routine slow for shorter strings is the somewhat involved set-up
once it's set up, it runs fast, because few strings have 80h chars in them

zedd151

  • Member
  • ****
  • Posts: 871
Re: Slight modification to StrLen in the masm32 library - 9% speed increase
« Reply #42 on: October 07, 2015, 07:33:07 AM »
I'm looking at them right now Dave, thanks for the links.

Now I am going to compare apples and oranges  :P

(Just to see how my algo fares against the two in the links you provided)
I'm not always the sharpest knife in the drawer, but I have my moments.  :P