News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

New right trim algo.

Started by hutch--, June 13, 2014, 01:14:42 AM

Previous topic - Next topic

jj2007

#15
Thanks. The performance of any rtrim() algo depends basically on
1. the cycles required to find the end of the string (len, Len, lstrlen, whatever)
2. the cycles required to find (backwards) the first char above Ascii 32

Once found, Masm32 inserts the zero delimiter there; MasmBasic Rtrim$() does not change the original but rather passes the pointer to the original plus the string len without spaces to the string engine.

LarryC

Allocating 589824 megabytes for testing
If it crashes due to lack of memory, reduce the allocation
size in the 'lpcount' equate above
94 milliseconds rtrim1
78 milliseconds rtrim2
78 milliseconds rtrim1
93 milliseconds rtrim2
78 milliseconds rtrim1
78 milliseconds rtrim2
78 milliseconds rtrim1
93 milliseconds rtrim2
Press any key to continue ...

dedndave

prescott w/htt @ 3 GHz
171 milliseconds rtrim1
171 milliseconds rtrim2
172 milliseconds rtrim1
187 milliseconds rtrim2
171 milliseconds rtrim1
172 milliseconds rtrim2
172 milliseconds rtrim1
172 milliseconds rtrim2


may i suggest.....
seperate the functions
you want to work on the RTrim operation, not StrLen
and what about UNICODE aware ?   :icon_eek:

hutch--

Dave,

The two algos do not use a len operation, they are a single pass from left to right that determines the last non blank space then places a terminator after it. If I have it right the same speed of both algos is determined by the memory access and for a byte scan design it probably will not go any faster.

A unicode version would look very similar but would read WORD size characters, not byte.

nidud

#19
deleted

Gunther

Hi nidud,

results from strtrim:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
171741  cycles - 0: standard (scasb)
10004   cycles - 1: rtrim1
11408   cycles - 2: rtrim2
11234   cycles - 3: new strlen (AgnerFog)

71151   cycles - 0: standard (scasb)
9874    cycles - 1: rtrim1
26642   cycles - 2: rtrim2
27218   cycles - 3: new strlen (AgnerFog)

133009  cycles - 0: standard (scasb)
23452   cycles - 1: rtrim1
26731   cycles - 2: rtrim2
27158   cycles - 3: new strlen (AgnerFog)

--- ok ---


Gunther
You have to know the facts before you can distort them.

hutch--

Here is a version that also tests an algo that scans the length then back scans to find last acceptable character. It is substantially slower than the single pass versions. It uses Agner Fog's old StrLen algo unrolled by 4 then back scans the string to find that last character. It could be optimised some more but there is little gain in it as the back scanner only has a single memory access.


Allocating 589824 megabytes for testing
If it crashes due to lack of memory, reduce the allocation
size in the 'lpcount' equate above
94 milliseconds rtrim1
94 milliseconds rtrim2
125 milliseconds rtrim3
94 milliseconds rtrim1
93 milliseconds rtrim2
125 milliseconds rtrim3
94 milliseconds rtrim1
94 milliseconds rtrim2
125 milliseconds rtrim3
94 milliseconds rtrim1
94 milliseconds rtrim2
125 milliseconds rtrim3
Press any key to continue ...


This is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    rtrim1 PROTO :DWORD         ; first test algo
    rtrim2 PROTO :DWORD         ; second test algo
    rtrim3 PROTO :DWORD         ; third test algo

    .data
      item1 db "this is a test of rtrim algos                  ",0  ; 48 bytes total
      item2 db "       this is a test of rtrim algos           ",0  ; 48 bytes total
      item3 db "              this is a test of rtrim algos    ",0  ; 48 bytes total
      item4 db "                                               ",0  ; 48 bytes total

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    lpcount equ <1024*1024*12>

main proc

    LOCAL txt   :DWORD
    LOCAL hArr  :DWORD
    LOCAL pArr  :DWORD
    LOCAL aInd  :DWORD
    LOCAL lcnt  :DWORD
    LOCAL icnt  :DWORD

    push esi
    push edi

    print "Allocating "
    print ustr$(lpcount*48 / 1024)," megabytes for testing",13,10
    print "If it crashes due to lack of memory, reduce the allocation",13,10
    print "size in the 'lpcount' equate above",13,10

    invoke SetPriorityClass,rv(GetCurrentProcess),HIGH_PRIORITY_CLASS

  mov icnt, 4
  tlp:

  ; ******************************************
  ; write an array of strings
  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr1:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr1
  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

  tarr1:
    mov txt, rv(rtrim1,[edi])
    mov txt, rv(rtrim1,[edi+4])
    mov txt, rv(rtrim1,[edi+8])
    mov txt, rv(rtrim1,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr1

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim1",13,10

  ; ******************************************
  ; write an array of strings
  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr2:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr2
  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

  tarr2:
    mov txt, rv(rtrim2,[edi])
    mov txt, rv(rtrim2,[edi+4])
    mov txt, rv(rtrim2,[edi+8])
    mov txt, rv(rtrim2,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr2

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim2",13,10

  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr3:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr3

  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

  tarr3:
    mov txt, rv(rtrim3,[edi])
    mov txt, rv(rtrim3,[edi+4])
    mov txt, rv(rtrim3,[edi+8])
    mov txt, rv(rtrim3,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr3

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim3",13,10


  ; ******************************************

    sub icnt, 1
    jnz tlp

    invoke SetPriorityClass,rv(GetCurrentProcess),NORMAL_PRIORITY_CLASS

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim1 proc ptxt:DWORD

  ; ------------------------------------------------
  ; one pass left to right to determine last valid
  ; character then terminate it after that character
  ; ------------------------------------------------

    mov edx, [esp+4]                ; load address into EDX
    mov ecx, edx                    ; store that address in ECX
    sub edx, 1

  lpst:
    add edx, 1
    movzx eax, BYTE PTR [edx]       ; zero extend byte into EAX
    test eax, eax                   ; test for zero terminator
    jz lpout                        ; exit loop on zero
    cmp eax, 32                     ; test if space character or lower
    jbe lpst                        ; jump back if below or equal
    mov ecx, edx                    ; store updated last character location in ECX
    jmp lpst                        ; jump back to loop start

  lpout:
    cmp ecx, [esp+4]                ; if ECX has not been modified (empty string)
    je nxt                          ; jump to NXT label
    add ecx, 1                      ; add 1 to ECX to write terminator past last character
  nxt:
    mov BYTE PTR [ecx], 0           ; write the terminator

    mov eax, [esp+4]                ; put original stack address in EAX as return address
    ret 4                           ; BYE !

rtrim1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim2 proc ptxt:DWORD

  ; ------------------------------------------------
  ; one pass left to right to determine last valid
  ; character then terminate it after that character
  ; ------------------------------------------------
    mov edx, [esp+4]                ; load address into EDX
    mov ecx, edx                    ; store that address in ECX
    jmp ji

  pre:
    mov ecx, edx                    ; store updated last character location in ECX
  lpst:
    add edx, 1
  ji:
    movzx eax, BYTE PTR [edx]       ; zero extend byte into EAX
    cmp eax, 32                     ; test if space character or lower
    ja pre
    test eax, eax                   ; test for zero terminator
    jnz lpst                        ; fall through on zero

  lpout:
    cmp ecx, [esp+4]                ; if ECX has not been modified (empty string)
    je nxt                          ; jump to NXT label
    add ecx, 1                      ; add 1 to ECX to write terminator past last character
  nxt:
    mov BYTE PTR [ecx], 0           ; write the terminator

    mov eax, [esp+4]                ; put original stack address in EAX as return address
    ret 4                           ; BYE !

rtrim2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim3 proc ptxt:DWORD

    mov     eax, [esp+4]            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    esi
    push    edi
    mov     edi, 80808080h

  @@:     
  REPEAT 3
    mov     esi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [esi-01010101h]    ; subtract 1 from each byte
    not     esi                     ; invert all bytes
    and     ecx, esi                ; and these two
    and     ecx, edi
    jnz     nxt
  ENDM

    mov     esi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; 4 increment DWORD pointer
    lea     ecx, [esi-01010101h]    ; subtract 1 from each byte
    not     esi                     ; invert all bytes
    and     ecx, esi                ; and these two
    and     ecx, edi
    jz      @B                      ; no zero bytes, continue loop

  nxt:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length

  ; --------------------------------------
  ; set up for back scanning end of string
  ; --------------------------------------
    mov esi, [esp+4][8]             ; lower
    mov edi, eax
    add edi, esi                    ; higher
    add edi, 1

  bscn:
    sub edi, 1
    cmp esi, edi                    ; don't scan below start address
    je bsout
    cmp BYTE PTR [edi], 32          ; test if ascii 32 or less
    jbe bscn                        ; loop back if unwanted character

    add edi, 1                      ; correct for string with characters

  bsout:
    mov BYTE PTR [edi], 0           ; terminate string
    mov eax, esi                    ; copy start address into EAX

    pop     edi
    pop     esi

    ret 4                           ; BYE !

rtrim3 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

nidud

#22
deleted

dedndave

prescott w/htt @ 3.0 GHz
Allocating 589824 megabytes for testing
If it crashes due to lack of memory, reduce the allocation
size in the 'lpcount' equate above
188 milliseconds rtrim1
172 milliseconds rtrim2
297 milliseconds rtrim3
172 milliseconds rtrim1
188 milliseconds rtrim2
297 milliseconds rtrim3
188 milliseconds rtrim1
187 milliseconds rtrim2
297 milliseconds rtrim3
188 milliseconds rtrim1
188 milliseconds rtrim2
297 milliseconds rtrim3

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
31757   cycles - 0: standard (scasb)
27877   cycles - 1: rtrim1
17383   cycles - 2: rtrim2
12989   cycles - 3: new strlen (AgnerFog)
18351   cycles - 4: rtrimUltra (Masm32 len)
16126   cycles - 5: rtrim3
13127   cycles - 6: repeat 4 (AgnerFog)
13788   cycles - 7: rtrimUltra (MasmBasic Len) - SSE
8063    cycles - 8: strtrim unaligned - SSE

31439   cycles - 0: standard (scasb)
27853   cycles - 1: rtrim1
18167   cycles - 2: rtrim2
12934   cycles - 3: new strlen (AgnerFog)
18120   cycles - 4: rtrimUltra (Masm32 len)
15627   cycles - 5: rtrim3
13031   cycles - 6: repeat 4 (AgnerFog)
13014   cycles - 7: rtrimUltra (MasmBasic Len) - SSE
8047    cycles - 8: strtrim unaligned - SSE

33208   cycles - 0: standard (scasb)
27703   cycles - 1: rtrim1
17897   cycles - 2: rtrim2
15139   cycles - 3: new strlen (AgnerFog)
18173   cycles - 4: rtrimUltra (Masm32 len)
15778   cycles - 5: rtrim3
13135   cycles - 6: repeat 4 (AgnerFog)
10862   cycles - 7: rtrimUltra (MasmBasic Len) - SSE
10029   cycles - 8: strtrim unaligned - SSE

jj2007

Quote from: nidud on June 15, 2014, 09:24:15 PM
Some new results including SSE functions.

Nice work :t
It seems your SSE2 version is a bit like Len() without the overhead.

But Hutch has created a testbed that identifies other champions, and I still have not fully understood why. No cache, of course, with hundreds of megabytes to process. And the trailing spaces are quite many (which makes the testbed a bit unrealistic 8)), but still, that doesn't explain why the one-pass should be faster than the len-plus-search-backwards strategy ::)

hutch--

I did the data this way for a reason, small to large trailing spaces on a reasonable length string (48 bytes) to test the linear speed of the forward scan, something that should suit a DWORD scanner over a BYTE scanner but then taking extra time to back scan to find the last non character position. The very large sample was to get a duration long enough, I used to test on 100 meg but with faster processors its too small.

Testing on a very large linear source defeats loop and cache effects and you are left with processing time on always new data, not looped in cache data.

nidud

#27
deleted

Gunther

Results strtrim3:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
13273   cycles - 0: standard (scasb)
12690   cycles - 1: rtrim1
10177   cycles - 2: rtrim2
4261    cycles - 3: new strlen (AgnerFog)
5656    cycles - 4: rtrimUltra (Masm32 len)
3653    cycles - 5: rtrim3
3450    cycles - 6: repeat 4 (AgnerFog)
2589    cycles - 7: rtrimUltra (MasmBasic Len) - SSE
1501    cycles - 8: strtrim unaligned - SSE

13906   cycles - 0: standard (scasb)
32205   cycles - 1: rtrim1
24778   cycles - 2: rtrim2
11307   cycles - 3: new strlen (AgnerFog)
13530   cycles - 4: rtrimUltra (Masm32 len)
8814    cycles - 5: rtrim3
8101    cycles - 6: repeat 4 (AgnerFog)
5996    cycles - 7: rtrimUltra (MasmBasic Len) - SSE
4181    cycles - 8: strtrim unaligned - SSE

32142   cycles - 0: standard (scasb)
30590   cycles - 1: rtrim1
24709   cycles - 2: rtrim2
10966   cycles - 3: new strlen (AgnerFog)
13328   cycles - 4: rtrimUltra (Masm32 len)
8860    cycles - 5: rtrim3
8013    cycles - 6: repeat 4 (AgnerFog)
5974    cycles - 7: rtrimUltra (MasmBasic Len) - SSE
3501    cycles - 8: strtrim unaligned - SSE

--- ok ---


Gunther
You have to know the facts before you can distort them.

hutch--

I suspected there was a stuffup in the benchmark I had used because it was not responding to code changes but I found the problem, I was not resetting EDI before each read block performing the rtrim function. Very different results which are now making sense. Once I got it working correctly I have unrolled the 3rd algo (Agner Fog's Strlen algo with back scanner) by a factor of 8 which is overdoing it for an algo of this type but on this old quad, its about 5 times faster than the one pass scanners. I also mis-aligned the data by 1 byte to reduce the effectiveness of aligned reads but it made no difference in timings.


391 milliseconds rtrim1
422 milliseconds rtrim2
78 milliseconds rtrim3
375 milliseconds rtrim1
406 milliseconds rtrim2
78 milliseconds rtrim3
390 milliseconds rtrim1
406 milliseconds rtrim2
78 milliseconds rtrim3
390 milliseconds rtrim1
406 milliseconds rtrim2
78 milliseconds rtrim3
Press any key to continue ...


The modified benchmark is as follows.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    rtrim1 PROTO :DWORD         ; first test algo
    rtrim2 PROTO :DWORD         ; second test algo
    rtrim3 PROTO :DWORD         ; third test algo

    .data
    align 4
    db 0
      item1 db "this is a test of rtrim algos                  ",0  ; 48 bytes total

    align 4
    db 0
      item2 db "       this is a test of rtrim algos           ",0  ; 48 bytes total

    align 4
    db 0
      item3 db "              this is a test of rtrim algos    ",0  ; 48 bytes total

    align 4
    db 0
      item4 db "                                               ",0  ; 48 bytes total

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    lpcount equ <1024*1024*4>

main proc

    LOCAL txt   :DWORD
    LOCAL hArr  :DWORD
    LOCAL pArr  :DWORD
    LOCAL aInd  :DWORD
    LOCAL lcnt  :DWORD
    LOCAL icnt  :DWORD

    push esi
    push edi

    invoke SetPriorityClass,rv(GetCurrentProcess),HIGH_PRIORITY_CLASS

  mov icnt, 4
  tlp:

  ; ******************************************
  ; write an array of strings
  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr1:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr1
  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov edi, pArr

  tarr1:
    mov txt, rv(rtrim1,[edi])
    mov txt, rv(rtrim1,[edi+4])
    mov txt, rv(rtrim1,[edi+8])
    mov txt, rv(rtrim1,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr1

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim1",13,10

  ; ******************************************
  ; write an array of strings
  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr2:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr2
  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov edi, pArr

  tarr2:
    mov txt, rv(rtrim2,[edi])
    mov txt, rv(rtrim2,[edi+4])
    mov txt, rv(rtrim2,[edi+8])
    mov txt, rv(rtrim2,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr2

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim2",13,10

  ; ------------------------------------------
    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov pArr, rv(create_array,lpcount,48)       ; pointer array handle
    mov hArr, ecx                               ; main array memory handle
    mov edi, pArr

  larr3:
    cst [edi],    OFFSET item1
    cst [edi+4],  OFFSET item2
    cst [edi+8],  OFFSET item3
    cst [edi+12], OFFSET item4
    add edi, 4
    sub lcnt, 1
    jnz larr3

  ; ------------------------------------------

    invoke GetTickCount
    push eax

    mov eax, lpcount
    shr eax, 2
    mov lcnt, eax

    mov edi, pArr

  tarr3:
    mov txt, rv(rtrim3,[edi])
    mov txt, rv(rtrim3,[edi+4])
    mov txt, rv(rtrim3,[edi+8])
    mov txt, rv(rtrim3,[edi+12])
    add edi, 4
    sub lcnt, 1
    jnz tarr3

    free pArr
    free hArr

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," milliseconds rtrim3",13,10

  ; ******************************************

    sub icnt, 1
    jnz tlp

    invoke SetPriorityClass,rv(GetCurrentProcess),NORMAL_PRIORITY_CLASS

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim1 proc ptxt:DWORD

  ; ------------------------------------------------
  ; one pass left to right to determine last valid
  ; character then terminate it after that character
  ; ------------------------------------------------

    mov edx, [esp+4]                ; load address into EDX
    mov ecx, edx                    ; store that address in ECX
    sub edx, 1

  lpst:
    add edx, 1
    movzx eax, BYTE PTR [edx]       ; zero extend byte into EAX
    test eax, eax                   ; test for zero terminator
    jz lpout                        ; exit loop on zero
    cmp eax, 32                     ; test if space character or lower
    jbe lpst                        ; jump back if below or equal
    mov ecx, edx                    ; store updated last character location in ECX
    jmp lpst                        ; jump back to loop start

  lpout:
    cmp ecx, [esp+4]                ; if ECX has not been modified (empty string)
    je nxt                          ; jump to NXT label
    add ecx, 1                      ; add 1 to ECX to write terminator past last character
  nxt:
    mov BYTE PTR [ecx], 0           ; write the terminator

    mov eax, [esp+4]                ; put original stack address in EAX as return address
    ret 4                           ; BYE !

rtrim1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim2 proc ptxt:DWORD

  ; ------------------------------------------------
  ; one pass left to right to determine last valid
  ; character then terminate it after that character
  ; ------------------------------------------------
    mov edx, [esp+4]                ; load address into EDX
    mov ecx, edx                    ; store that address in ECX
    jmp ji

  pre:
    mov ecx, edx                    ; store updated last character location in ECX
  lpst:
    add edx, 1
  ji:
    movzx eax, BYTE PTR [edx]       ; zero extend byte into EAX
    cmp eax, 32                     ; test if space character or lower
    ja pre
    test eax, eax                   ; test for zero terminator
    jnz lpst                        ; fall through on zero

  lpout:
    cmp ecx, [esp+4]                ; if ECX has not been modified (empty string)
    je nxt                          ; jump to NXT label
    add ecx, 1                      ; add 1 to ECX to write terminator past last character
  nxt:
    mov BYTE PTR [ecx], 0           ; write the terminator

    mov eax, [esp+4]                ; put original stack address in EAX as return address
    ret 4                           ; BYE !

rtrim2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

rtrim3 proc ptxt:DWORD

    mov     eax, [esp+4]            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    esi
    push    edi
    mov     edi, 80808080h

  @@:     
  REPEAT 7
    mov     esi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [esi-01010101h]    ; subtract 1 from each byte
    not     esi                     ; invert all bytes
    and     ecx, esi                ; and these two
    and     ecx, edi
    jnz     nxt
  ENDM

    mov     esi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; 4 increment DWORD pointer
    lea     ecx, [esi-01010101h]    ; subtract 1 from each byte
    not     esi                     ; invert all bytes
    and     ecx, esi                ; and these two
    and     ecx, edi
    jz      @B                      ; no zero bytes, continue loop

  nxt:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length

  ; --------------------------------------
  ; set up for back scanning end of string
  ; --------------------------------------
    mov esi, [esp+4][8]             ; lower
    mov edi, eax
    add edi, esi                    ; higher
    add edi, 1

  bscn:
  REPEAT 7
    sub edi, 1
    cmp esi, edi                    ; don't scan below start address
    je bsout
    cmp BYTE PTR [edi], 32          ; test if ascii 32 or less
    ja bnxt                         ; loop back if unwanted character
  ENDM

    sub edi, 1
    cmp esi, edi                    ; don't scan below start address
    je bsout
    cmp BYTE PTR [edi], 32          ; test if ascii 32 or less
    jbe bscn                        ; loop back if unwanted character

  bnxt:
    add edi, 1                      ; correct for string with characters

  bsout:
    mov BYTE PTR [edi], 0           ; terminate string
    mov eax, esi                    ; copy start address into EAX

    pop     edi
    pop     esi

    ret 4                           ; BYE !

rtrim3 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start