News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

__int64 multiplication with overflow check

Started by weltling, October 16, 2014, 06:25:08 AM

Previous topic - Next topic

weltling

Hi all,

I'm writing a routine to multiply signed 64 bit integers with catching overflow. Here's my code




.code
        mult proc

                ; rcx == __int64 a
                ; rdx == __int64 b
                ; r8  == __int64 *lret
                ; r9 == double *dret

xorpd xmm0,xmm0
xorpd xmm1,xmm1
xor             rax,rax

                cvtsi2sd        xmm0,rcx
                cvtsi2sd        xmm1,rdx
                mulsd           xmm0,xmm1

neg_chk:
xorpd xmm1,xmm1
xor             r10,r10
                cvtsi2sd        xmm1,r10
                comisd          xmm0,xmm1
                jb              neg_int
jae pos_int

pos_int:
xorpd xmm1,xmm1
mov             r10,9223372036854775807
                cvtsi2sd        xmm1,r10
                comisd          xmm0,xmm1
jbe dbl_as_int
ja as_dbl

neg_int:
xorpd xmm1,xmm1
mov            r10,-9223372036854775808
                cvtsi2sd       xmm1,r10
                comisd         xmm0,xmm1
jae dbl_as_int
jb as_dbl


as_dbl:
movsd           qword ptr[r9],xmm0
                mov             rax,1

                jmp             done

dbl_as_int:
                cvttsd2si       r10,xmm0
                mov             qword ptr[r8],r10
xor             rax,rax
jmp done

done:           
ret
        mult endp
end




The corresponding signature I use in C

__int8 mult(__int64 a, __int64 b, __int64 *lval, double *dval);

This code works fine until it meets values a=2147452546 b=2147514748, in that case it delivers 4611686013165148160 instead of 4611686013165148408 . I'm just wondering what could possibly be going wrong? This seems to be an issue with the 128 bit arithmetics or conversion to 64 bit. Say, writing


push rdx
imul rdx,rcx

delivers correct. So wondering, how such an error can remotely happen with values of even bigger bitness? And, maybe there's a way to check this kind of thing through flags?

Thanks.

qWord

You are converting 64 bit integers to floating point value with 53 precision bits thus rounding can occur.
Test the overflow- or carry flag (after IMUL) to detect if the multiplication overflows.
MREAL macros - when you need floating point arithmetic while assembling!

weltling

Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.

My first variant was exactly what you say - IMUL, then JO to another label redoing with double. However I was looking for a faster solution which a coprosessor might bring. Currently any variant I write is slower than a C program optimized with PGO, and that's somehow inciting  :)

Thanks.

qWord

Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.

Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution
if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).
MREAL macros - when you need floating point arithmetic while assembling!

weltling

Quote from: qWord on October 16, 2014, 07:28:57 AM
Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.

Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution
if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).

Do you mean using something like __mul128() intrinsic? That would probably solve that with integers only, but in the overflow case a repeated multiplication would be needed to convert to double.

I use Visual Studio which doesn't support inline ASM in 64 bit mode, sadly. It has that great PGO (profile guided optimization) feature to train binaries for certain use cases. Using it to benchmark implementations I do for ml64. Looking at the disassembly, it's horrible, uses a lot of stack whereby registers are free. But it's faster in this case, especially in edge cases. Maybe that's one of those rare cases where a C compiler does a better job, lets see :)

qWord

maybe the following function does it for you:
mult proc
       
        mov eax,1
        cvtsi2sd xmm0,rcx
        cvtsi2sd xmm1,rdx
        imul rcx,rdx
        mulsd xmm0,xmm1
        mov [r8],rcx
        movsd REAL8 ptr [r9],xmm0
        jno @NOF
        xor eax,eax ; integer overflow -> rax = 0
@NOF:   ret
       
mult endp


EDIT: new proc
MREAL macros - when you need floating point arithmetic while assembling!

weltling

Yeah, this works! And this is much more elegant than things I do :) I just to had to revrse the return value (expecting 1 on overflow).

While testing on this, I've realized some strange behaviors. Namely, of course I randomize the input, but when measuring two calls one right after the other, the result is almost neglecting, 1 tick . Whereby when I insert a sleep of 1ms between the calls - the ASM implementation seems faster for about 10 ticks. This makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM? Here's my test code https://github.com/weltling/ml64_exercise/blob/master/int64_overflow/mul_bench.c

Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?

Thanks.

Gunther

Hi weltling,

Quote from: weltling on October 16, 2014, 09:53:23 PM
Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?

I think that these instructions are not pairable. You can check it here. And welcome to the forum.

Gunther
You have to know the facts before you can distort them.

qWord

Quote from: weltling on October 16, 2014, 09:53:23 PMThis makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM?
Of course,  timing the code as part of an real algorithm with (in best case) real data is more meaningful.
If you want to measure just the function, do it in Assembler using the RDTSC (read time stamp counter) instruction. Two macros to generate a test loop:

; x64-Version of MichaelW's macros
counter_begin MACRO loopcount:REQ
LOCAL label

IFNDEF tmcb__nLoops
.data
align 16
tmcb__nLoops dd 0
tmcb__cntr dd 0
tmcb__qw dq 2 dup (?)
.code
ENDIF

mov tmcb__nLoops,loopcount
xor rax,rax
cpuid
rdtsc

mov DWORD ptr tmcb__qw[0],eax
mov DWORD ptr tmcb__qw[4],edx
mov tmcb__cntr, loopcount
xor rax,rax
cpuid
align 16
@@:
sub tmcb__cntr,1
jnz @B

xor rax,rax
cpuid
rdtsc
shl rdx,32
or rax,rdx
sub rax,tmcb__qw[0]
mov tmcb__qw[0],rax

xor rax, rax
cpuid
rdtsc
mov tmcb__cntr,loopcount
mov DWORD ptr tmcb__qw[8],eax
mov DWORD ptr tmcb__qw[12],edx
xor rax,rax
cpuid
align 16
label:
tmcb__label equ <label>
ENDM

counter_end MACRO
sub tmcb__cntr,1
jnz tmcb__label

xor rax,rax
cpuid
rdtsc
shl rdx,32
or rax,rdx
sub rax,tmcb__qw[0]
sub rax,tmcb__qw[8]
mov tmcb__qw[0],rax

finit
fild tmcb__qw[0]
fild tmcb__nLoops
fdiv
fistp tmcb__qw[0]

mov rax,tmcb__qw[0]
ENDM

; usage:
counter_begin 10000 ; <- iteration count
    ; one or more calls to your function
counter_end
; rax = average tick count




Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?
yes.
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi weltling,

Quote from: qWord on October 17, 2014, 12:22:56 AM
Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?
yes.

I should read your post more precise. QWord is right: IMUL (integer unit) and MULSD (floating point unit) can execute in parallel (concurrent processing). But IMUL execute only in the integer U-pipe.

Gunther
You have to know the facts before you can distort them.

weltling

Many thanks for the tips and the sample implementation, learned a lot from that already. I'd be probably better working on integrating this into the real app where it's actually dedicated. It has numerous test cases, so will be easier to test both perf and functional parts. Will be reporting the experience then.