Hi all,
I'm writing a routine to multiply signed 64 bit integers with catching overflow. Here's my code
.code
mult proc
; rcx == __int64 a
; rdx == __int64 b
; r8 == __int64 *lret
; r9 == double *dret
xorpd xmm0,xmm0
xorpd xmm1,xmm1
xor rax,rax
cvtsi2sd xmm0,rcx
cvtsi2sd xmm1,rdx
mulsd xmm0,xmm1
neg_chk:
xorpd xmm1,xmm1
xor r10,r10
cvtsi2sd xmm1,r10
comisd xmm0,xmm1
jb neg_int
jae pos_int
pos_int:
xorpd xmm1,xmm1
mov r10,9223372036854775807
cvtsi2sd xmm1,r10
comisd xmm0,xmm1
jbe dbl_as_int
ja as_dbl
neg_int:
xorpd xmm1,xmm1
mov r10,-9223372036854775808
cvtsi2sd xmm1,r10
comisd xmm0,xmm1
jae dbl_as_int
jb as_dbl
as_dbl:
movsd qword ptr[r9],xmm0
mov rax,1
jmp done
dbl_as_int:
cvttsd2si r10,xmm0
mov qword ptr[r8],r10
xor rax,rax
jmp done
done:
ret
mult endp
end
The corresponding signature I use in C
__int8 mult(__int64 a, __int64 b, __int64 *lval, double *dval);
This code works fine until it meets values a=2147452546 b=2147514748, in that case it delivers 4611686013165148160 instead of 4611686013165148408 . I'm just wondering what could possibly be going wrong? This seems to be an issue with the 128 bit arithmetics or conversion to 64 bit. Say, writing
push rdx
imul rdx,rcx
delivers correct. So wondering, how such an error can remotely happen with values of even bigger bitness? And, maybe there's a way to check this kind of thing through flags?
Thanks.
You are converting 64 bit integers to floating point value with 53 precision bits thus rounding can occur.
Test the overflow- or carry flag (after IMUL) to detect if the multiplication overflows.
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
My first variant was exactly what you say - IMUL, then JO to another label redoing with double. However I was looking for a faster solution which a coprosessor might bring. Currently any variant I write is slower than a C program optimized with PGO, and that's somehow inciting :)
Thanks.
Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.
Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution
if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).
Quote from: qWord on October 16, 2014, 07:28:57 AM
Quote from: leta on October 16, 2014, 07:17:31 AM
Ah, thanks for the explanation. From the description - CVTSI2SD and others looked more like they really convert, not cast.
sorry, my above stamen was maybe a bit misleading/incomplete: the problem is that you are using 53 bit precision arithmetic, while (at least) 63 bit precision would be required.
Quote from: leta on October 16, 2014, 07:17:31 AMHowever I was looking for a faster solution
if inline-assembler is possible with your compiler, inline that few instructions (IMUL+JO+...).
Do you mean using something like __mul128() intrinsic? That would probably solve that with integers only, but in the overflow case a repeated multiplication would be needed to convert to double.
I use Visual Studio which doesn't support inline ASM in 64 bit mode, sadly. It has that great PGO (profile guided optimization) feature to train binaries for certain use cases. Using it to benchmark implementations I do for ml64. Looking at the disassembly, it's horrible, uses a lot of stack whereby registers are free. But it's faster in this case, especially in edge cases. Maybe that's one of those rare cases where a C compiler does a better job, lets see :)
maybe the following function does it for you:
mult proc
mov eax,1
cvtsi2sd xmm0,rcx
cvtsi2sd xmm1,rdx
imul rcx,rdx
mulsd xmm0,xmm1
mov [r8],rcx
movsd REAL8 ptr [r9],xmm0
jno @NOF
xor eax,eax ; integer overflow -> rax = 0
@NOF: ret
mult endp
EDIT: new proc
Yeah, this works! And this is much more elegant than things I do :) I just to had to revrse the return value (expecting 1 on overflow).
While testing on this, I've realized some strange behaviors. Namely, of course I randomize the input, but when measuring two calls one right after the other, the result is almost neglecting, 1 tick . Whereby when I insert a sleep of 1ms between the calls - the ASM implementation seems faster for about 10 ticks. This makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM? Here's my test code https://github.com/weltling/ml64_exercise/blob/master/int64_overflow/mul_bench.c (https://github.com/weltling/ml64_exercise/blob/master/int64_overflow/mul_bench.c)
Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?
Thanks.
Hi weltling,
Quote from: weltling on October 16, 2014, 09:53:23 PM
Another question, about your implementation - do I get it right that imul and mulsd will be executing possibly in parallel?
I think that these instructions are not pairable. You can check it here (http://www.xiaohui.com/dev/mmx/mmx_p_10_1.htm). And welcome to the forum.
Gunther
Quote from: weltling on October 16, 2014, 09:53:23 PMThis makes me harder to trust the synthetic test less in this case. Actually doing always by a scenario to make a test not predictable was working well (make it depend on input, on some IO, etc.). Is there some specific I'm not aware of for ASM?
Of course, timing the code as part of an real algorithm with (in best case) real data is more meaningful.
If you want to measure just the function, do it in Assembler using the RDTSC (read time stamp counter) instruction. Two macros to generate a test loop:
; x64-Version of MichaelW's macros
counter_begin MACRO loopcount:REQ
LOCAL label
IFNDEF tmcb__nLoops
.data
align 16
tmcb__nLoops dd 0
tmcb__cntr dd 0
tmcb__qw dq 2 dup (?)
.code
ENDIF
mov tmcb__nLoops,loopcount
xor rax,rax
cpuid
rdtsc
mov DWORD ptr tmcb__qw[0],eax
mov DWORD ptr tmcb__qw[4],edx
mov tmcb__cntr, loopcount
xor rax,rax
cpuid
align 16
@@:
sub tmcb__cntr,1
jnz @B
xor rax,rax
cpuid
rdtsc
shl rdx,32
or rax,rdx
sub rax,tmcb__qw[0]
mov tmcb__qw[0],rax
xor rax, rax
cpuid
rdtsc
mov tmcb__cntr,loopcount
mov DWORD ptr tmcb__qw[8],eax
mov DWORD ptr tmcb__qw[12],edx
xor rax,rax
cpuid
align 16
label:
tmcb__label equ <label>
ENDM
counter_end MACRO
sub tmcb__cntr,1
jnz tmcb__label
xor rax,rax
cpuid
rdtsc
shl rdx,32
or rax,rdx
sub rax,tmcb__qw[0]
sub rax,tmcb__qw[8]
mov tmcb__qw[0],rax
finit
fild tmcb__qw[0]
fild tmcb__nLoops
fdiv
fistp tmcb__qw[0]
mov rax,tmcb__qw[0]
ENDM
; usage:
counter_begin 10000 ; <- iteration count
; one or more calls to your function
counter_end
; rax = average tick count
Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?
yes.
Hi weltling,
Quote from: qWord on October 17, 2014, 12:22:56 AM
Quote from: weltling on October 16, 2014, 09:53:23 PM
do I get it right that imul and mulsd will be executing possibly in parallel?
yes.
I should read your post more precise. QWord is right: IMUL (integer unit) and MULSD (floating point unit) can execute in parallel (concurrent processing). But IMUL execute only in the integer U-pipe.
Gunther
Many thanks for the tips and the sample implementation, learned a lot from that already. I'd be probably better working on integrating this into the real app where it's actually dedicated. It has numerous test cases, so will be easier to test both perf and functional parts. Will be reporting the experience then.