#### dedndave

• Member
• Posts: 8829
• Still using Abacus 2.0
« Reply #45 on: September 27, 2015, 03:07:05 AM »
i highly recommend Ray's tutorial (and his libraries, too)

http://www.ray.masmcode.com/

#### 0000

• Member
• Posts: 872
« Reply #46 on: September 27, 2015, 03:08:17 AM »
i highly recommend Ray's tutorial (and his libraries, too)

http://www.ray.masmcode.com/

I just posted above yours :P

#### rrr314159

• Member
• Posts: 1382
« Reply #47 on: September 27, 2015, 04:00:04 AM »
Quote from: dedndave
if you just want to add 2 strings together,  working directly on ascii may be ok

- I bet just adding 2 numbers (or even doing a few additions) can be done faster working directly with ascii, than any other way. You have to remember that conversion from ascii decimal to binary integer - and back again - takes time. Not so sure about subtraction. For mult and div, or many adds/subtracts in a row, undoubtedly better to convert to binary first.

Quote from: dedndave
if you want fast, learn to use the FPU and SSE floats

- for exact precision of very large numbers (like 10^256, or 10^2048, as used in RSA algo) it's not so clear that floating point is the right way? 80 bits only gets you to 10^24 so you still need to chain them together - tricky to get exact precision? Don't forget 64-bit integers (with 64-bit code) are very easy to use. I suppose somebody knows what's the best approach; my guess, 64-bit integer is better than FPU (or SSE) - for exact precision. AVX, not to mention AVX512, is another story

Quote from: zedd151
Was a fun little project, wasn't it? ...
Was a fun project though.  8)

- well, sure! ... also, as I say, if you only want one or a few additions ascii approach is probably the best
« Last Edit: September 27, 2015, 07:13:43 AM by rrr314159 »
I am NaN ;)

#### 0000

• Member
• Posts: 872
« Reply #48 on: September 27, 2015, 04:06:50 AM »
... also, as I say, if you only want one or a few additions ascii approach is probably the best

You just gave me an idea  :idea:

Suppose we have a variable number of additions to do. I am going to *try* to run a simple
test using 3, 4, 5 inputs, to determine how much longer each extra calculation takes.

During those tests, for simplicity all the input strings will have the same length.
Then again, they don't necessarily have to be the same length....

thinking....................

#### 0000

• Member
• Posts: 872
« Reply #49 on: September 27, 2015, 09:08:26 AM »
Tried a minor rewrite, attempting to free up ebp from usage.

Code: [Select]
`    OPTION PROLOGUE:NONE    OPTION EPILOGUE:NONE    align 16    nops 12    asciiadderNA proc src1:dword, src2:dword, dst:dword        push ebx        push esi        push edi        mov esi, [esp+10h]        mov ebx, [esp+14h]        invoke StrLen, esi        push eax        invoke StrLen, ebx        mov edx, eax        pop ecx        mov edi, [esp+18h]        add edi, ecx        add edi, 1    @@:        xor eax, eax        mov al, byte ptr [esi+ecx]        add al, byte ptr [ebx+edx]        AAA        add al, 30h        add byte ptr [edi], al        mov byte ptr [edi-1], ah        dec edi        dec edx        dec ecx        cmp ecx, -1        jnz @b        dec edi        add word ptr [edi], 3030h        pop edi        pop esi        pop ebx        ret 12    asciiadderNA endp    OPTION PROLOGUE:PrologueDef    OPTION EPILOGUE:EpilogueDef`
Results on m,y machine were about ~20 cycles slower. I even used the unrolled StrLen proc,

and took a shorcut (necessitating the input strings be the same length)

It was a half-assed try any way.

I have another idea to do this two bytes at a time, no shr or ror needed....
lemme try it out.
I still want to exclude ebp from casual usage.....

#### 0000

• Member
• Posts: 872
« Reply #50 on: September 27, 2015, 09:30:56 AM »

#### rrr314159

• Member
• Posts: 1382
« Reply #51 on: September 27, 2015, 09:45:23 AM »
I managed to squeeze another 17 cycles out of the fastest (so far) algo. Now ebp is left alone.

- I thought you were moving on to subtraction! Maybe I'll revisit the addition algo, see if I can squeeze out a few cycles.

- BTW, do u (and / or anyone else) agree that (as I think) for a few additions, this ascii-based way ought to be the fastest? If so it makes the exercise worthwhile; if not, not. For mult, div, any serious manipulations, clearly it's best to convert to binary and do all arithmetic there, then convert back. But there could be apps where just additions are desired. So do u (and / or anyone else) agree that "ascii arithmetic" is probably the fastest way to add two (or a few) large decimal ascii numbers?
I am NaN ;)

#### 0000

• Member
• Posts: 872
« Reply #52 on: September 27, 2015, 10:00:17 AM »
Ummmm...the project is on hold. I found a bug where if one string is much shorter, a sort of buffer under run, if that makes any sense.

Now back to the drawing board.

I know it was working 100% at one point
Code: [Select]
`1plus9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999equals10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000`
all good now. Unfortunately that is basically the old algo.

Somewhere in the optimized main loop of the 'fast' algo, there is some sort of bug....
.................................................
..................................... now to go bughunting.

That'll teach me to make a bunch of changes without testing the results.  ::)

The old algo is the only one that works with inputs with uneven lengths.
I have tried to coerce the new algo to work properly, but have failed.

The new algo only works if the input string lengths are equal. :(

My software does not have bugs, it just has features you need to understand...
....to use it properly.
« Last Edit: September 27, 2015, 12:02:58 PM by zedd151 »

#### 0000

• Member
• Posts: 872
« Reply #53 on: September 27, 2015, 01:22:18 PM »
Okay a fresh new look at this little program.

I removed ebp as a general purpose register. Also using StrLen, to obtain of course the length of the strings.

Code: [Select]
`    include \masm32\include\masm32rt.inc    asciiadder PROTO :DWORD, :DWORD, :DWORD    .data         ss1 db "1", 0         ss2 db "99999999999999999999999999999999999", 0         dd1 db 40h dup (0)    .code    start:        invoke asciiadder, addr ss2, addr ss1, addr dd1        invoke MessageBoxA, 0, addr dd1, 0, 0        invoke ExitProcess, 0    OPTION PROLOGUE:NONE    OPTION EPILOGUE:NONE    asciiadder proc src1:dword, src2:dword, dst:dword        push ebx        push esi        push edi        mov esi, [esp+10h]        mov ebx, [esp+14h]        mov edi, [esp+18h]        invoke StrLen, esi        dec eax        push eax        invoke StrLen, ebx        dec eax        mov edx, eax        pop ecx        cmp ecx, edx        jl edxgr        add edi, ecx        jmp setdst    edxgr:        add edi, edx    setdst:        inc edi        ; ---------------------- main loop    calc:        mov al, 0        cmp ecx, -1        js @f        mov al, byte ptr [esi+ecx]    @@:        cmp edx, -1        js @f        add al, byte ptr [ebx+edx]    @@:        sub al, 30h        cmp al, 0Ah        jl @f        sub al, 30h    @@:        cmp al, 0Ah        jl @f        sub al, 0Ah        inc byte ptr [edi-1]    @@:        add al, 30h        add al, byte ptr [edi]        cmp al, 39h        jng @f        sub al, 0Ah        inc byte ptr [edi-1]    @@:        mov byte ptr [edi], al        dec edi        dec edx        dec ecx        cmp edi, [esp+18h]        ja calc        ; ---------------------- end main loop        cmp byte ptr [edi], 30h        jg @f        add byte ptr [edi], 30h    @@:        pop edi        pop esi        pop ebx        ret 12    asciiadder endp    OPTION PROLOGUE:PrologueDef    OPTION EPILOGUE:EpilogueDef    end start`
Now I will once again concentrate on optimizing the main loop. Or at least try to minimize the number of jumps in this thing.
« Last Edit: July 20, 2018, 04:53:23 AM by zedd151 »

#### 0000

• Member
• Posts: 872
« Reply #54 on: September 27, 2015, 02:24:52 PM »
New Test Piece.

Easy to use fixture for testing the modifications I will be making while 'optimising' the AsciiAdder.
;)
« Last Edit: July 20, 2018, 04:52:47 AM by zedd151 »

#### rrr314159

• Member
• Posts: 1382
« Reply #55 on: September 27, 2015, 03:04:58 PM »
We both tried the approach of adding dwords at a time, and subtracting 30303030h 4 bytes at a time, etc. Makes sense but doesn't gain enough for the trouble; byte-by-byte is still faster. Obvious is to use SSE: 16 bytes at a time will probably be fastest way. If I get around to it ...
I am NaN ;)

#### 0000

• Member
• Posts: 872
« Reply #56 on: September 27, 2015, 03:18:54 PM »
I have reduced the register usage:

Code: [Select]
`    OPTION PROLOGUE:NONE    OPTION EPILOGUE:NONE    asciiadder proc src1:dword, src2:dword, dst:dword        push edi        mov edi, [esp+10h]        invoke StrLen, [esp+8]        push eax        invoke StrLen, [esp+10h]        mov edx, eax        pop ecx        cmp ecx, edx        jl edxgr        add edi, ecx        jmp setdst    edxgr:        add edi, edx    setdst:        add ecx, [esp+8]        add edx, [esp+0Ch]        dec ecx        dec edx        ; ---------------------- main loop            calc:        xor eax, eax        cmp ecx, [esp+8]        jl @f        add al, byte ptr [ecx]    @@:        cmp edx, [esp+0Ch]        jl @f        add al, byte ptr [edx]    @@:        sub al, 30h        cmp al, 0Ah        jl @f        sub al, 30h    @@:        cmp al, 0Ah        jl @f        sub al, 0Ah        inc byte ptr [edi-1]    @@:        add al, 30h        add al, byte ptr [edi]        cmp al, 39h        jng @f        sub al, 0Ah        inc byte ptr [edi-1]    @@:        mov byte ptr [edi], al        dec ecx        dec edx        dec edi        cmp edi, [esp+10h]        jg calc        inc edi        ; ---------------------- end main loop                cmp byte ptr [edi-1], 30h        jg @f        add byte ptr [edi-1], 30h    @@:        pop edi        ret 12    asciiadder endp    OPTION PROLOGUE:PrologueDef    OPTION EPILOGUE:EpilogueDef`
Now I think I can put AAA back in. using a different mechnism to detect the start of the uneven string. I had previously relied on checking the sign flag for the buffers' counter register.

But since the buffers no longer have counters associated, I am simply directly comparing with the src pointer for each buffer. It took a little while to get to this point.

Now I will go through the process again of trying to get it up to speed. (again)

#### 0000

• Member
• Posts: 872
« Reply #57 on: September 27, 2015, 03:53:23 PM »
Okay!! I have finally managed to both keep register usage to minimum, and also reinstate using the AAA instruction.  :icon_exclaim:

It took longer than I thought it should

Code: [Select]
`    OPTION PROLOGUE:NONE    OPTION EPILOGUE:NONE    asciiadder proc src1:dword, src2:dword, dst:dword        push edi        mov edi, [esp+10h]            invoke StrLen, [esp+8]        push eax            invoke StrLen, [esp+10h]        mov edx, eax        pop ecx        cmp ecx, edx        jl edxgr        add edi, ecx        jmp setdst    edxgr:        add edi, edx    setdst:        add ecx, [esp+8]        add edx, [esp+0Ch]        dec ecx        dec edx    calc:        xor eax, eax        mov al, byte ptr [edi]        cmp ecx, [esp+8]      jl @f        add al, byte ptr [ecx]      @@:        cmp edx, [esp+0Ch]      jl @f        add al, byte ptr [edx]      @@:        cmp al, 39h      jle @f        AAA                         ; AAA per suggestion by dedndave        add al, 30h      @@:        mov byte ptr [edi-1], ah    ; ah contains the carry byte        mov byte ptr [edi], al      ; al contains the sum        dec ecx        dec edx        dec edi        cmp edi, [esp+10h]        jg calc        inc edi        cmp byte ptr [edi-1], 30h      jg @f        add byte ptr [edi-1], 30h      @@:        pop edi        ret 12    asciiadder endp    OPTION PROLOGUE:PrologueDef    OPTION EPILOGUE:EpilogueDef`
Still don't like all the cmp's and jxx's in there...
« Last Edit: July 20, 2018, 04:51:31 AM by zedd151 »

#### 0000

• Member
• Posts: 872
« Reply #58 on: September 27, 2015, 04:05:47 PM »
Okay, here are the numbers so far..

Code: [Select]
` newstyle --- * with the lower register usage. and AAA33993344334533633350original old style - from the very first version, for comarison54985543551055135555`
« Last Edit: July 20, 2018, 04:50:59 AM by zedd151 »

#### dedndave

• Member
• Posts: 8829
• Still using Abacus 2.0
`newstyle --- *84928504850585048745original old style77838219785378737852`