__int64 multiplication with overflow check

leta · October 16, 2014, 06:25:08 AM

Hi all,

I'm writing a routine to multiply signed 64 bit integers with catching overflow. Here's my code




.code
        mult proc

                ; rcx == __int64 a
                ; rdx == __int64 b
                ; r8  == __int64 *lret
                ; r9 == double *dret

		xorpd		xmm0,xmm0
		xorpd		xmm1,xmm1
		xor             rax,rax

                cvtsi2sd        xmm0,rcx
                cvtsi2sd        xmm1,rdx
                mulsd           xmm0,xmm1

neg_chk:
		xorpd		xmm1,xmm1
		xor             r10,r10
                cvtsi2sd        xmm1,r10
                comisd          xmm0,xmm1
                jb              neg_int
		jae		pos_int

pos_int:
		xorpd		xmm1,xmm1
		mov             r10,9223372036854775807
                cvtsi2sd        xmm1,r10
                comisd          xmm0,xmm1
		jbe		dbl_as_int
		ja		as_dbl

neg_int:
		xorpd		xmm1,xmm1
		mov            r10,-9223372036854775808
                cvtsi2sd       xmm1,r10
                comisd         xmm0,xmm1
		jae		dbl_as_int
		jb		as_dbl


as_dbl:
		movsd           qword ptr[r9],xmm0
                mov             rax,1

                jmp             done

dbl_as_int:
                cvttsd2si       r10,xmm0
                mov             qword ptr[r8],r10
		xor             rax,rax
		jmp		done

done:           
		ret
        mult endp
end

The corresponding signature I use in C

__int8 mult(__int64 a, __int64 b, __int64 *lval, double *dval);

This code works fine until it meets values a=2147452546 b=2147514748, in that case it delivers 4611686013165148160 instead of 4611686013165148408 . I'm just wondering what could possibly be going wrong? This seems to be an issue with the 128 bit arithmetics or conversion to 64 bit. Say, writing

Code Select


push rdx
imul rdx,rcx

delivers correct. So wondering, how such an error can remotely happen with values of even bigger bitness? And, maybe there's a way to check this kind of thing through flags?

Thanks.

qWord · October 16, 2014, 06:38:09 AM

You are converting 64 bit integers to floating point value with 53 precision bits thus rounding can occur.
Test the overflow- or carry flag (after IMUL) to detect if the multiplication overflows.

leta · October 16, 2014, 07:17:31 AM

Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.

My first variant was exactly what you say - IMUL, then JO to another label redoing with double. However I was looking for a faster solution which a coprosessor might bring. Currently any variant I write is slower than a C program optimized with PGO, and that's somehow inciting :)

Thanks.

qWord · October 16, 2014, 07:28:57 AM

Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.

sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.

Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution

if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).

leta · October 16, 2014, 08:03:12 AM

Quote from: qWord on October 16, 2014, 07:28:57 AM
Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.

Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution
if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).

Do you mean using something like __mul128() intrinsic? That would probably solve that with integers only, but in the overflow case a repeated multiplication would be needed to convert to double.

I use Visual Studio which doesn't support inline ASM in 64 bit mode, sadly. It has that great PGO (profile guided optimization) feature to train binaries for certain use cases. Using it to benchmark implementations I do for ml64. Looking at the disassembly, it's horrible, uses a lot of stack whereby registers are free. But it's faster in this case, especially in edge cases. Maybe that's one of those rare cases where a C compiler does a better job, lets see :)

qWord · October 16, 2014, 06:40:29 PM

maybe the following function does it for you:

Code Select

mult proc
        
        mov eax,1
        cvtsi2sd xmm0,rcx
        cvtsi2sd xmm1,rdx
        imul rcx,rdx
        mulsd xmm0,xmm1
        mov [r8],rcx
        movsd REAL8 ptr [r9],xmm0
        jno @NOF
        xor eax,eax			; integer overflow -> rax = 0
@NOF:   ret
        
mult endp

EDIT: new proc

leta · October 16, 2014, 09:53:23 PM

Yeah, this works! And this is much more elegant than things I do :) I just to had to revrse the return value (expecting 1 on overflow).

While testing on this, I've realized some strange behaviors. Namely, of course I randomize the input, but when measuring two calls one right after the other, the result is almost neglecting, 1 tick . Whereby when I insert a sleep of 1ms between the calls - the ASM implementation seems faster for about 10 ticks. This makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM? Here's my test code https://github.com/weltling/ml64_exercise/blob/master/int64_overflow/mul_bench.c

Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?

Thanks.

Gunther · October 17, 2014, 12:00:08 AM

Hi weltling,

Quote from: weltling on October 16, 2014, 09:53:23 PM
Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?

I think that these instructions are not pairable. You can check it here. And welcome to the forum.

Gunther

qWord · October 17, 2014, 12:22:56 AM

Quote from: weltling on October 16, 2014, 09:53:23 PMThis makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM?

Of course, timing the code as part of an real algorithm with (in best case) real data is more meaningful.
If you want to measure just the function, do it in Assembler using the RDTSC (read time stamp counter) instruction. Two macros to generate a test loop:

Code Select


; x64-Version of MichaelW's macros
counter_begin MACRO loopcount:REQ
LOCAL label

	IFNDEF tmcb__nLoops
		.data
			align 16
			tmcb__nLoops	dd 0
			tmcb__cntr		dd 0
			tmcb__qw		dq 2 dup (?)
		.code
	ENDIF

	mov tmcb__nLoops,loopcount
	xor rax,rax
	cpuid
	rdtsc

	mov DWORD ptr tmcb__qw[0],eax
	mov DWORD ptr tmcb__qw[4],edx
	mov tmcb__cntr, loopcount
	xor rax,rax
	cpuid
	align 16
@@:
	sub tmcb__cntr,1
	jnz @B

	xor rax,rax
	cpuid
	rdtsc
	shl rdx,32
	or rax,rdx
	sub rax,tmcb__qw[0]
	mov tmcb__qw[0],rax

	xor rax, rax
	cpuid
	rdtsc
	mov tmcb__cntr,loopcount
	mov DWORD ptr tmcb__qw[8],eax
	mov DWORD ptr tmcb__qw[12],edx
	xor rax,rax
	cpuid
	align 16
label:
	tmcb__label equ <label>
ENDM

counter_end MACRO
	sub tmcb__cntr,1
	jnz tmcb__label

	xor rax,rax
	cpuid
	rdtsc
	shl rdx,32
	or rax,rdx
	sub rax,tmcb__qw[0]
	sub rax,tmcb__qw[8]
	mov tmcb__qw[0],rax

	finit
	fild tmcb__qw[0]
	fild tmcb__nLoops
	fdiv
	fistp tmcb__qw[0]

	mov rax,tmcb__qw[0]
ENDM

; usage:
counter_begin 10000 ; <- iteration count
    ; one or more calls to your function
counter_end
; rax = average tick count

Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?

yes.

Gunther · October 17, 2014, 03:41:11 AM

Hi weltling,

Quote from: qWord on October 17, 2014, 12:22:56 AM
Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?
yes.

I should read your post more precise. QWord is right: IMUL (integer unit) and MULSD (floating point unit) can execute in parallel (concurrent processing). But IMUL execute only in the integer U-pipe.

Gunther

leta · October 17, 2014, 06:20:24 AM

Many thanks for the tips and the sample implementation, learned a lot from that already. I'd be probably better working on integrating this into the real app where it's actually dedicated. It has numerous test cases, so will be easier to test both perf and functional parts. Will be reporting the experience then.

The MASM Forum

News:

__int64 multiplication with overflow check

leta

qWord

leta

qWord

leta

qWord

leta

Gunther

qWord

Gunther

leta